International Journal of Computer Engineering and Applications, Volume XII, Issue III, March 18, ISSN

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications, Volume XII, Issue III, March 18, ISSN"

Joseph Allen
5 years ago
Views:

International Journal of Computer Engineering and Applications, STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM Chandrakant D. Patel 1, Dr. Jayesh M.

ABSTRACT: Searching using top 10 search engine 1 to find ગ ધ જ or ગ ન ધ જ and surprised to see the result which far differs from one to another. As in the Gujarati language, both strings are correct.

1 International Journal of Computer Engineering and Applications, STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM Chandrakant D. Patel 1, Dr. Jayesh M. Patel 2 1 Research Scholar, Hemchandracharya North Gujarat University, Gujarat, India. 2 MCA Department, Acharya Motibhai Patel Institute of Computer Studies, Ganpat University, Gujarat, India. ABSTRACT: Searching using top 10 search engine 1 to find ગ ધ જ or ગ ન ધ જ and surprised to see the result which far differs from one to another. As in the Gujarati language, both strings are correct. Therefore, String similarity algorithm is useful for text mining applications while we generate index saving space and time, both. Basically, string similarity compares each character from both strings but it may not give the accurate result on highly rich Gujarati language due to different kinds of writing styles which depend on matras, reph, vatu, and diacritics on simple and compound alphabets. GUJSIM (GUJarati SIMilarity) algorithm is the hybrid approach to do strings similarity for the Gujarati language. Here, the author compares 70 strings pairs and GUJSIM algorithm which gives optimum percentage result. This algorithm also helps to reduce the percentage of index based on the unique string. Keywords: Gujarati Language, String Similarity, Phonetic, String Distance Algorithm [1] INTRODUCTION In General, a similarity is a comparison of commonality between different objects. An object can be two strings or corpus or knowledge. The first paper introduced by R. W. Hamming [1] for detecting & correcting codes while transforming the bits. Here, an author s concentrate on string similarity approach. [2] Strings which are generally needed to search while processing in the search engine. [3] [5] A research community already represented a 1 Chandrakant D. Patel and Dr. Jayesh M. Patel 82

2 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM list of string similarity algorithms for non-indian language. For Indian languages, there is space to research in this domain. A string similarity algorithm working in NLP shown in [Figure 1] are useful, questions & answering system, text summarization, speech recognition, spelling correction, word sense disambiguation, plagiarism, short answer grading as well as for stemmer. [3] Generally, we consider the only comparison for strings with side by side characters. These kinds of comparison may give a different result due to matras & diacritics for Indian languages and it may not be fruitful for that language which is highly rich and free order such as Gujarati. Exa. ન દર (to sleep) and નનન દર (to sleep). [2] GUJARATI LANGUAGE India is a cosmopolitan country with languages. As per Indian government records, for scheduled language in descending order of speakers strength who wrote the Gujarati as mother tongue in 2001, a Gujarati language at Seven position with 4.48% of total population and as per record of CIA 54.6 million speaks Gujarati in India and 65.6 million speakers of Gujarati World Wide in 2011 census. By the way, Gujarati was the first language of the father of India called as Gandhi (Mohandas K. Gandhi) and father of Pakistan called Jinnah (Mohammed Ali Jinnah).[6] Figure 1 Overview of String Similarity Algorithm [1][16] [2.1] GUJARATI LANGUAGE FEATURES Gujarati is morphologically rich and free order language. It is the native language of Gujarat state and a member of Indo-Aryan family of languages. A hierarchical structure [Table 1] of Gujarati language could be like this: Sentences અય ધ ય મ ર મ ર જ ય ચ લત આવય છ (Ram state has been running in Ayodhya) Phrase અય ધ ય મ + ર મ ર જ ય + ચ લત આવય છ Word અય ધ ય મ + ર મ + ર જ ય + ચ લત + આવય + છ Morpheme અય ધ ય + મ + ર મ + ર જ ય + ચ લત + આવય + છ Phoneme ayodhyo+ma+ram+rajya+chaltu+aavyu+chhe Chandrakant D. Patel and Dr. Jayesh M. Patel 83

3 International Journal of Computer Engineering and Applications, Table 1 The hierarchy of Gujarati syntax The Gujarati alphabets have two kinds: Simple and Compound. Simple alphabet represented by a single letter while compound alphabets are conjunction with two or more letters with diacritics and/or matras. Through different diacritics, the words become richer than simple word while in the form of writing but phonic sound help to identify the difference between them. A list of conjunction letter studied from Bhagavadgomanal 2 and listed in [ Table 2]. થ Sr. No. Conjunction forms Example(s) 1 Half-form of consonants મ + + ય = મ ય 2 Upper-based form of Reph ર + + થ = 3 Lower-based form of Reph પ + + ર = પ ર 4 Vattu Variants ડ ર = ર + + ડ 5 Matra, Reph, vowel modifications ગ = + ગ + + ર Table 2 Sample of Compound Letter The features of Gujarati language [6] [8] are as follows: 1. It has no articles, capital letters and follows the Devanagari script and decedent from the Sanskrit language. 2. Its vocabulary contains four kinds of words: Tatsam (accept from the Sanskrit language), Tadbhav (adopt from Sanskrit with the change in phonetics), Native (original words of Gujarati language), Loanwords (adopt from others languages). 3. Gujarati alphabet has 34 consonants and 14 vowels with independent (અ,આ,એ...) and dependent (,,...). [3] LITERATURE REVIEW Through literature review in [Figure 1], we focused on two types of algorithm for this research paper: Edit Distance algorithm [9] [12] and Phonetic algorithm. [13] [16] We found that few algorithms developed in phonetic based string similarity and there is no directly applicable algorithm for the Gujarati language which highly rich and free order form. Harry which was developed by [17] more than 10 different algorithms to compare string similarity, measuring and its analysis with the matrix of distance values. The Soundex algorithm considering the first algorithm for matching the string based on sounds for the English language. It was represented by Robert C. Russell in It is a code which represents a letter with 3 digits. The variants of Soundex algorithm are reverse Soundex introduced by NYSIIS (New York State Identification and Intelligence System) in It is used a long time ago (no device were present) in the US. It has not enough accuracy. It mostly depends on initial letter. D-M Soundex proposed by Gary Mokotoff in It may fail to misspelled words and US Patent (issued ) Chandrakant D. Patel and Dr. Jayesh M. Patel 84

4 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM slower than Soundex. Kölner phonetics proposed in 1969 for German words, only. The Soundex algorithm has little bit deficiencies which introduced a new algorithm in 1990 by Lawrence Phillips with two variants Double Metaphone in and Metaphone-3 which got high accuracy i.e 99% in the English language. The Caver phone algorithm [18] developed by David Hood in 2002 and modified in 2004 for English language only. So, this literature review, we couldn t apply these algorithms to Gujarati language words in [Table 4]. [4] PROPOSED ALGORITHM: GUJSIM (STRING1, STRING2) : FLOAT (0.0 TO 1.0) Using given basic model which is motivated by [13] in [Figure 2] which can easily applicable to Gujarati language and author hope it may work on other Indian languages, also. GUJSIM algorithm: Input: String_1 and String_2. Process: Phonetic algorithm (A), Distance algorithm (B) Output: Distance of String_1 and String_2. A. Phonetic Algorithm Step-01: Start. Figure 2 Flowchart of GUJSIM algorithm Step-02: Tokenization the strings using NLTK (Natural Language Tool Kit) in python. Step-03: This process replaces and generates an encoded string for both strings with following rules. Step-03.01: Repeat of each character for both strings. Step-03.02: Skip the character in both strings such as (Halant-હલન ત) Step-03.03: Replace the occurrence for each character with a code as [Table 3]. Step-03.04: Store both strings as encoded results. Step-04: Send encoded strings to distance algorithm. Chandrakant D. Patel and Dr. Jayesh M. Patel 85

5 International Journal of Computer Engineering and Applications, Step-05: Stop. Table 3 Encode character with Gujarati alphabets Characters code Characters Code અ આ A પ ફ બ ભ K ઇ ઈ B મ ન ણ L ઉ ઊ C ય M એ ઐ D ઋ ર N ઑ ઓ ઔ E લ ળ O ક ખ ગ ઘ F વ P ઙ ઞ G શ ષ સ Q ચ છ જ ઝ H હ R ટ ઠ ડ ઢ I ઍ ઽ ૐ 0 ત થ દ ધ J B. Distance Algorithm Step-01: Start Step-02: Input encoded strings. Step-03: Apply distance algorithm to find the distance between strings. Step-04: Calculate the string similarity using (1) Max( DA) Max( len( Str1, 2)..(1) D 1 Str Here, DA stands Distance Algorithm value such LEVENSHTEIN, SORENSEN, JACCARD, FAST_COMP, HAMMING Max() function provide maximum value from input numbers, Len() function provide the length of given string. Step-05: Store the result and review it. Step-06: Stop [5] RESULT ANALYSIS WITH FUTURE SCOPE The hybrid approach algorithm tested on Python language using NLTK, Distance and Jellyfish package for counting the string distance algorithm for Gujarati words. We got the optimum result which is displayed in [Table 5] and we tested 70 string pairs. Using, wrongly identification of stemming words are nearly 2% to 5% of the strings which need to improve the result using string similarity approach. With the help of this algorithm, we can reduce the size of index especially when we are going to search online or stemming the operations. The GUJSIM hybrid approach implemented, tested and got enough result on Gujarati words with similar phonetic words and differently written due to the rich morphology of Gujarati language. Thus, in a future algorithm can test on large corpus for better accuracy and developed universal string similarity algorithm for any languages. Using this algorithm, we can extend for stemming the words in Indian languages which are mostly rich in inflectional. Chandrakant D. Patel and Dr. Jayesh M. Patel 86

6 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM The total strings in this research paper are 140. Using the python set function the strings number is going to down and is equal to 129. So, the percentage of string compression is Unique String / Total String * 100. Here, 129 / 140 *100 = But, as per the GUJSIM algorithm, only 70 pairs of the string are unique because the string already checked with similarity phonetic characteristics for the Gujarati words. So, the final string compression will down to 70 / 140 * 100 = 50%. Using, this GUJSIM algorithm, we got the best result while you consider the index of string based on unique words. [6] ACKNOWLEDGMENT The author extremely thanks, Acharya Motibhai Patel Institute of Computer Studies (MCA Department) of Ganpat University. The author thanks from the bottom of heart to Guide Dr. Jayesh M. Patel whose help when stuck to solve the issue regarding algorithms and students of MCA for language writing (Gujarati Words), also. Thanks to wife Anita and son Sarthak to provide a healthy environment. Chandrakant D. Patel and Dr. Jayesh M. Patel 87

7 International Journal of Computer Engineering and Applications, Table 4 Distance value of algorithms on Gujarati words STRING-1 STRING-1 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING ન દર નનન દર અનસ ધ ન અનસન ધ ન ક ત ન કન ત ન સ સ ધન સ શ ધન ગ ધ જ ગ ન ધ જ પ રનતક પ રત ક ત લ મ ત નલમ સમજ ત સમજ ત અનકરણ ય અન કરણ ય ગનહણ ગહ ણ કનમટ કમ ટ આલ બમ આલબમ ખશનમ ખસનમ મ જ ર મન જ ર અ ક ર અન ક ર ક ર સ ન ક ર શ ન આજ ન ક આજ ક ફરજ ય ત ફરનજય ત સ ત ષ સ ત શ ચ દ રક ત ચન દ રક ન ત નદ ળ નદ લ પરમ ત મ પરમ તમ ઈશ વર ઈસ ર નજન દ જ દ ન દર નનન દર નનરપ ક ષ નનરપ કશ અનસ ધ ન અનસન ધ ન ક ત ન કન ત ન સ સ ધન સ શ ધન સ ર થક સ રર ક પત ન પતન ન શ વ સ ન સ સ પ રસ દ પરસ દ કણથ કરણ ઓમક ર શ વર ઓમ ક ર શ વર ર નપય ર પ ય Chandrakant D. Patel and Dr. Jayesh M. Patel 88

8 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM STRING-1 STRING-1 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING જ લ સ ર જ ળ શ વર ક ષન ય કશ ય નનરપ ક ષ ન રપ ક શ આ રદ આવ રદ ભ લ ભલ નદન દ ન ધમથ ધરમ કમથ કરમ ગ લ ન ગલ ન પ ણ પ ન ગણ ત ગન ત અન ત અનનત ક નત ક નન ત નદ સ દ સ ક ષ કષ અર થ અરર જ ઈન ટ જ ઈ ટ પ પતર પ નન સલ પ નનસલ પ થ પ ર નશ શ ક મ યટર ક મ યટર ગણ શજ ગન શનજ નશક ર સ ક ર શ મ ત સ મ ત જમર ખ જમ રખ શ નત સ ત નશ લ સ ળ લક ષ મ જ લકશમ જ મ ઘદ ત મ ગદ ત દ ત દ ન ત ગ ય ન ગય ન જ ઞ ન જ ઞ નન આ રદ આવ રદ Chandrakant D. Patel and Dr. Jayesh M. Patel 89

9 STRING-1 International Journal of Computer Engineering and Applications, Table 5 Distance value of GUJSIM on Gujarati words STRING- 2 Encoded-1 Encoded-2 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING GUJSIM ન દર નનન દર LBLJN LBLJN અનસ ધ ન અનસન ધ ન ALCQLJAL ALCQLJAL ક ત ન કન ત ન FLJAL FLJAL સ સ ધન સ શ ધન QLQEJL QLQEJL ગ ધ જ ગ ન ધ જ FALJBHB FALJBHB પ રનતક પ રત ક KNJBF KNJBF ત લ મ ત નલમ JAOB JAOBL સમજ ત સમજ ત QLHCJ QLHCJB અનકરણ ય અન કરણ ય ALCFNLM ALCFNLBM ગનહણ ગહ ણ FCRBLB FCRBLB કનમટ કમ ટ FLBIB FLBIB આલ બમ આલબમ AOKL AOKL ખશનમ ખસનમ FCQLCLA FCQLCLA મ જ ર મન જ ર LLHCN LLHCN અ ક ર અન ક ર ALFCN ALFCN ક ર સ ન ક ર શ ન FDNEQBL FDNEQBL આજ ન ક આજ ક AHBPBFA AHBPBFA ફરજ ય ત ફરનજય ત KNHBMAJ KNHBMAJ સ ત ષ સ ત શ QLJEQ QLJEQ ચ દ રક ત ચન દ રક ન ત HLJNFALJ HLJNFALJ નદ ળ નદ લ JBPAOB JBPAOB પરમ ત મ પરમ તમ KNLAJLA KNLAJLA ઈશ વર ઈસ ર BQPN BQPN નજન દ જ દ HBLJA HBLJA ન દર નનન દર LBLJN LBLJN નનરપ ક ષ નનરપ કશ LBNKDFQ LBNKDFQ અનસ ધ ન અનસન ધ ન ALCQLJAL ALCQLJAL ક ત ન કન ત ન FLJAL FLJAL સ સ ધન સ શ ધન QLQEJL QLQEJL સ ર થક સ રર ક QANJF QANJF પત ન પતન KJLB KJLB ન શ વ સ ન સ સ PBQPAQC PBQPAQC પ રસ દ પરસ દ KNQAJ KNQAJB કણથ કરણ FNL FNL ઓમક ર શ વર ઓમ ક ર શ વર ELFANDPN ELFANDQPN ર નપય ર પ ય NCKBMA NCKBMA જ લ સ ર જ ળ શ વર HAODQPN HAODQPN ક ષન ય કશ ય FQJNBM FQJNBM નનરપ ક ષ ન રપ ક શ LBNKDFQ LBNKDFQ આ રદ આવ રદ APNJA APNJA ભ લ ભલ KCO KCO Chandrakant D. Patel and Dr. Jayesh M. Patel 90

10 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM STRING-1 STRING- 2 Encoded-1 Encoded-2 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING GUJSIM નદન દ ન JBL JBL ધમથ ધરમ JNL JNL કમથ કરમ FNL FNL ગ લ ન ગલ ન FOALB FOALB પ ણ પ ન KALB KALB ગણ ત ગન ત FCLPJA FCLPJA અન ત અનનત ALBJA ALBJA ક નત ક નન ત FNALJB FNALJB નદ સ દ સ JBPQ JBPQ ક ષ કષ FQDJN FQDJN અર થ અરર ANJ ANJ જ ઈન ટ જ ઈ ટ HEBLI HEBLI પ પતર KCJN KCJN પ નન સલ પ નનસલ KDLQBO KDLQBO પ થ પ ર KCNJPB KCNJPB નશ શ QBP QBP ક મ યટર ક મ યટર FELKMCIN FELKMCIN ગણ શજ ગન શનજ FLDQHB FLDQHB નશક ર સ ક ર QBFAN QBFAN શ મ ત સ મ ત QBLLJ QBLLJ જમર ખ જમ રખ HLNCF HLNCF શ નત સ ત QALJB QALJB નશ લ સ ળ QBPAOA QBPAOA લક ષ મ જ લકશમ જ OFQLBHB OFQLBHB મ ઘદ ત મ ગદ ત LDFJCJ LDFJCJ દ ત દ ન ત PDJALJ PDJALJ ગ ય ન ગય ન FMALB FMALB જ ઞ ન જ ઞ નન HGALB HGALB આ રદ આવ રદ APNJA APNJA Chandrakant D. Patel and Dr. Jayesh M. Patel 91

11 International Journal of Computer Engineering and Applications, REFERENCES [1] R. W. Hamming, Error Detecting and Error Correcting Codes, Bell System Technical Journal, vol. 29, no. 2, pp , [2] S. Wandelt, S. Gerdjikov, E. Siragusa, A. Tiskin, W. Wang, and J. Wang, State-of-the-art in String Similarity Search and Join, SIGMOD Record, vol. 43, no. 1, pp , [3] C. D. Patel and J. M. Patel, A Review of Indian and Non-Indian Stemming : A focus on Gujarati Stemming Algorithms, International Journal of Advanced Research, vol. 3, no. 12, pp , [4] C. D. Patel, GUJSTER: a Rule based stemmer using Dictionary Approach, in IEEE - International Conference on Inventive Communication and Computational Technologies, 2017, no. March-2017, pp [5] C. D. Patel and T. Modi, A mode of web page categorization of Gujarati language, Journal Of Information, Knowledge And Research In Computer Science And Applications, vol. 4, no. 1, pp , [6] C. K. et. al. Bhensdadia, Introduction to Gujarati Wordnet, Proc. Third National Workshop on IndoWordNet, pp. 1 5, [7] S. S. Panchal, Gujarati WordNet A Lexical Database, International Journal of Computer Applications, vol. 116, no. 20, pp. 6 8, [8] C. D. Patel and J. M. Patel, Improving a Lightweight Stemmer for Gujarati Language, International Journal of Information Sciences and Techniques, vol. 6, no. 1, pp , [9] V. Levenshtein, Levenshtein Distance, pp. 1 6, [10] W. J. Masek, A Faster Algorithm Computing String Edit Distances*, Journal of Computer And System Sciences, vol. 1, no. 20, pp , [11] W. B. Frakes and C. J. Fox, Strength and similarity of affix removal stemming algorithms, ACM SIGIR Forum, vol. 37, no. 1, pp , [12] E. Ukkonen, Algorithms for approximate string matching, Information and Control, vol. 64, no. 1 3, pp , [13] V. Abbasi, Phonetic Analysis and Searching with Google Glass API, [14] C. Project and O. Technical, Caverphone : Phonetic Matching algorithm, vol. 12, no. Table 1, [15] J. Sheth, Gujarati Phonetics and Levenshtein based String Similarity Measure for Gujarati Language, NCILC, [16] V. P. Parmar, Study Existing Various Phonetic Algorithms and Designing and Development of a working model for the New Developed Algorithm and Comparison by implementing it with Existing Algorithm ( s ), International Journal of Computer Applications, vol. 98, no. 19, pp , [17] K. Rieck KONRADRIECK and C. Christian Wressnegger, Harry: A Tool for Measuring String Similarity, Journal of Machine Learning Research, vol. 17, pp. 1 5, [18] D. Hood, Caverphone : Phonetic Matching algorithm, Caversham Project Occasional Technical Paper, vol. 12, no. CTP060902, pp. 1 11, Chandrakant D. Patel and Dr. Jayesh M. Patel 92

GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering - SEMESTER II (CtoD) EXAMINATION WINTER 2014

Seat No.: Enrolment No. GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering - SEMESTER II (CtoD) EXAMINATION WINTER 2014 Subject Code: C320903 Date: 29-12-2014 Subject Name: D. C. CIRCUIT Time: 10:30