International Journal of Computer Engineering and Applications, Volume XII, Issue III, March 18, ISSN

Size: px
Start display at page:

Download "International Journal of Computer Engineering and Applications, Volume XII, Issue III, March 18, ISSN"

Transcription

1 International Journal of Computer Engineering and Applications, STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM Chandrakant D. Patel 1, Dr. Jayesh M. Patel 2 1 Research Scholar, Hemchandracharya North Gujarat University, Gujarat, India. 2 MCA Department, Acharya Motibhai Patel Institute of Computer Studies, Ganpat University, Gujarat, India. ABSTRACT: Searching using top 10 search engine 1 to find ગ ધ જ or ગ ન ધ જ and surprised to see the result which far differs from one to another. As in the Gujarati language, both strings are correct. Therefore, String similarity algorithm is useful for text mining applications while we generate index saving space and time, both. Basically, string similarity compares each character from both strings but it may not give the accurate result on highly rich Gujarati language due to different kinds of writing styles which depend on matras, reph, vatu, and diacritics on simple and compound alphabets. GUJSIM (GUJarati SIMilarity) algorithm is the hybrid approach to do strings similarity for the Gujarati language. Here, the author compares 70 strings pairs and GUJSIM algorithm which gives optimum percentage result. This algorithm also helps to reduce the percentage of index based on the unique string. Keywords: Gujarati Language, String Similarity, Phonetic, String Distance Algorithm [1] INTRODUCTION In General, a similarity is a comparison of commonality between different objects. An object can be two strings or corpus or knowledge. The first paper introduced by R. W. Hamming [1] for detecting & correcting codes while transforming the bits. Here, an author s concentrate on string similarity approach. [2] Strings which are generally needed to search while processing in the search engine. [3] [5] A research community already represented a 1 Chandrakant D. Patel and Dr. Jayesh M. Patel 82

2 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM list of string similarity algorithms for non-indian language. For Indian languages, there is space to research in this domain. A string similarity algorithm working in NLP shown in [Figure 1] are useful, questions & answering system, text summarization, speech recognition, spelling correction, word sense disambiguation, plagiarism, short answer grading as well as for stemmer. [3] Generally, we consider the only comparison for strings with side by side characters. These kinds of comparison may give a different result due to matras & diacritics for Indian languages and it may not be fruitful for that language which is highly rich and free order such as Gujarati. Exa. ન દર (to sleep) and નનન દર (to sleep). [2] GUJARATI LANGUAGE India is a cosmopolitan country with languages. As per Indian government records, for scheduled language in descending order of speakers strength who wrote the Gujarati as mother tongue in 2001, a Gujarati language at Seven position with 4.48% of total population and as per record of CIA 54.6 million speaks Gujarati in India and 65.6 million speakers of Gujarati World Wide in 2011 census. By the way, Gujarati was the first language of the father of India called as Gandhi (Mohandas K. Gandhi) and father of Pakistan called Jinnah (Mohammed Ali Jinnah).[6] Figure 1 Overview of String Similarity Algorithm [1][16] [2.1] GUJARATI LANGUAGE FEATURES Gujarati is morphologically rich and free order language. It is the native language of Gujarat state and a member of Indo-Aryan family of languages. A hierarchical structure [Table 1] of Gujarati language could be like this: Sentences અય ધ ય મ ર મ ર જ ય ચ લત આવય છ (Ram state has been running in Ayodhya) Phrase અય ધ ય મ + ર મ ર જ ય + ચ લત આવય છ Word અય ધ ય મ + ર મ + ર જ ય + ચ લત + આવય + છ Morpheme અય ધ ય + મ + ર મ + ર જ ય + ચ લત + આવય + છ Phoneme ayodhyo+ma+ram+rajya+chaltu+aavyu+chhe Chandrakant D. Patel and Dr. Jayesh M. Patel 83

3 International Journal of Computer Engineering and Applications, Table 1 The hierarchy of Gujarati syntax The Gujarati alphabets have two kinds: Simple and Compound. Simple alphabet represented by a single letter while compound alphabets are conjunction with two or more letters with diacritics and/or matras. Through different diacritics, the words become richer than simple word while in the form of writing but phonic sound help to identify the difference between them. A list of conjunction letter studied from Bhagavadgomanal 2 and listed in [ Table 2]. થ Sr. No. Conjunction forms Example(s) 1 Half-form of consonants મ + + ય = મ ય 2 Upper-based form of Reph ર + + થ = 3 Lower-based form of Reph પ + + ર = પ ર 4 Vattu Variants ડ ર = ર + + ડ 5 Matra, Reph, vowel modifications ગ = + ગ + + ર Table 2 Sample of Compound Letter The features of Gujarati language [6] [8] are as follows: 1. It has no articles, capital letters and follows the Devanagari script and decedent from the Sanskrit language. 2. Its vocabulary contains four kinds of words: Tatsam (accept from the Sanskrit language), Tadbhav (adopt from Sanskrit with the change in phonetics), Native (original words of Gujarati language), Loanwords (adopt from others languages). 3. Gujarati alphabet has 34 consonants and 14 vowels with independent (અ,આ,એ...) and dependent (,,...). [3] LITERATURE REVIEW Through literature review in [Figure 1], we focused on two types of algorithm for this research paper: Edit Distance algorithm [9] [12] and Phonetic algorithm. [13] [16] We found that few algorithms developed in phonetic based string similarity and there is no directly applicable algorithm for the Gujarati language which highly rich and free order form. Harry which was developed by [17] more than 10 different algorithms to compare string similarity, measuring and its analysis with the matrix of distance values. The Soundex algorithm considering the first algorithm for matching the string based on sounds for the English language. It was represented by Robert C. Russell in It is a code which represents a letter with 3 digits. The variants of Soundex algorithm are reverse Soundex introduced by NYSIIS (New York State Identification and Intelligence System) in It is used a long time ago (no device were present) in the US. It has not enough accuracy. It mostly depends on initial letter. D-M Soundex proposed by Gary Mokotoff in It may fail to misspelled words and US Patent (issued ) Chandrakant D. Patel and Dr. Jayesh M. Patel 84

4 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM slower than Soundex. Kölner phonetics proposed in 1969 for German words, only. The Soundex algorithm has little bit deficiencies which introduced a new algorithm in 1990 by Lawrence Phillips with two variants Double Metaphone in and Metaphone-3 which got high accuracy i.e 99% in the English language. The Caver phone algorithm [18] developed by David Hood in 2002 and modified in 2004 for English language only. So, this literature review, we couldn t apply these algorithms to Gujarati language words in [Table 4]. [4] PROPOSED ALGORITHM: GUJSIM (STRING1, STRING2) : FLOAT (0.0 TO 1.0) Using given basic model which is motivated by [13] in [Figure 2] which can easily applicable to Gujarati language and author hope it may work on other Indian languages, also. GUJSIM algorithm: Input: String_1 and String_2. Process: Phonetic algorithm (A), Distance algorithm (B) Output: Distance of String_1 and String_2. A. Phonetic Algorithm Step-01: Start. Figure 2 Flowchart of GUJSIM algorithm Step-02: Tokenization the strings using NLTK (Natural Language Tool Kit) in python. Step-03: This process replaces and generates an encoded string for both strings with following rules. Step-03.01: Repeat of each character for both strings. Step-03.02: Skip the character in both strings such as (Halant-હલન ત) Step-03.03: Replace the occurrence for each character with a code as [Table 3]. Step-03.04: Store both strings as encoded results. Step-04: Send encoded strings to distance algorithm. Chandrakant D. Patel and Dr. Jayesh M. Patel 85

5 International Journal of Computer Engineering and Applications, Step-05: Stop. Table 3 Encode character with Gujarati alphabets Characters code Characters Code અ આ A પ ફ બ ભ K ઇ ઈ B મ ન ણ L ઉ ઊ C ય M એ ઐ D ઋ ર N ઑ ઓ ઔ E લ ળ O ક ખ ગ ઘ F વ P ઙ ઞ G શ ષ સ Q ચ છ જ ઝ H હ R ટ ઠ ડ ઢ I ઍ ઽ ૐ 0 ત થ દ ધ J B. Distance Algorithm Step-01: Start Step-02: Input encoded strings. Step-03: Apply distance algorithm to find the distance between strings. Step-04: Calculate the string similarity using (1) Max( DA) Max( len( Str1, 2)..(1) D 1 Str Here, DA stands Distance Algorithm value such LEVENSHTEIN, SORENSEN, JACCARD, FAST_COMP, HAMMING Max() function provide maximum value from input numbers, Len() function provide the length of given string. Step-05: Store the result and review it. Step-06: Stop [5] RESULT ANALYSIS WITH FUTURE SCOPE The hybrid approach algorithm tested on Python language using NLTK, Distance and Jellyfish package for counting the string distance algorithm for Gujarati words. We got the optimum result which is displayed in [Table 5] and we tested 70 string pairs. Using, wrongly identification of stemming words are nearly 2% to 5% of the strings which need to improve the result using string similarity approach. With the help of this algorithm, we can reduce the size of index especially when we are going to search online or stemming the operations. The GUJSIM hybrid approach implemented, tested and got enough result on Gujarati words with similar phonetic words and differently written due to the rich morphology of Gujarati language. Thus, in a future algorithm can test on large corpus for better accuracy and developed universal string similarity algorithm for any languages. Using this algorithm, we can extend for stemming the words in Indian languages which are mostly rich in inflectional. Chandrakant D. Patel and Dr. Jayesh M. Patel 86

6 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM The total strings in this research paper are 140. Using the python set function the strings number is going to down and is equal to 129. So, the percentage of string compression is Unique String / Total String * 100. Here, 129 / 140 *100 = But, as per the GUJSIM algorithm, only 70 pairs of the string are unique because the string already checked with similarity phonetic characteristics for the Gujarati words. So, the final string compression will down to 70 / 140 * 100 = 50%. Using, this GUJSIM algorithm, we got the best result while you consider the index of string based on unique words. [6] ACKNOWLEDGMENT The author extremely thanks, Acharya Motibhai Patel Institute of Computer Studies (MCA Department) of Ganpat University. The author thanks from the bottom of heart to Guide Dr. Jayesh M. Patel whose help when stuck to solve the issue regarding algorithms and students of MCA for language writing (Gujarati Words), also. Thanks to wife Anita and son Sarthak to provide a healthy environment. Chandrakant D. Patel and Dr. Jayesh M. Patel 87

7 International Journal of Computer Engineering and Applications, Table 4 Distance value of algorithms on Gujarati words STRING-1 STRING-1 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING ન દર નનન દર અનસ ધ ન અનસન ધ ન ક ત ન કન ત ન સ સ ધન સ શ ધન ગ ધ જ ગ ન ધ જ પ રનતક પ રત ક ત લ મ ત નલમ સમજ ત સમજ ત અનકરણ ય અન કરણ ય ગનહણ ગહ ણ કનમટ કમ ટ આલ બમ આલબમ ખશનમ ખસનમ મ જ ર મન જ ર અ ક ર અન ક ર ક ર સ ન ક ર શ ન આજ ન ક આજ ક ફરજ ય ત ફરનજય ત સ ત ષ સ ત શ ચ દ રક ત ચન દ રક ન ત નદ ળ નદ લ પરમ ત મ પરમ તમ ઈશ વર ઈસ ર નજન દ જ દ ન દર નનન દર નનરપ ક ષ નનરપ કશ અનસ ધ ન અનસન ધ ન ક ત ન કન ત ન સ સ ધન સ શ ધન સ ર થક સ રર ક પત ન પતન ન શ વ સ ન સ સ પ રસ દ પરસ દ કણથ કરણ ઓમક ર શ વર ઓમ ક ર શ વર ર નપય ર પ ય Chandrakant D. Patel and Dr. Jayesh M. Patel 88

8 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM STRING-1 STRING-1 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING જ લ સ ર જ ળ શ વર ક ષન ય કશ ય નનરપ ક ષ ન રપ ક શ આ રદ આવ રદ ભ લ ભલ નદન દ ન ધમથ ધરમ કમથ કરમ ગ લ ન ગલ ન પ ણ પ ન ગણ ત ગન ત અન ત અનનત ક નત ક નન ત નદ સ દ સ ક ષ કષ અર થ અરર જ ઈન ટ જ ઈ ટ પ પતર પ નન સલ પ નનસલ પ થ પ ર નશ શ ક મ યટર ક મ યટર ગણ શજ ગન શનજ નશક ર સ ક ર શ મ ત સ મ ત જમર ખ જમ રખ શ નત સ ત નશ લ સ ળ લક ષ મ જ લકશમ જ મ ઘદ ત મ ગદ ત દ ત દ ન ત ગ ય ન ગય ન જ ઞ ન જ ઞ નન આ રદ આવ રદ Chandrakant D. Patel and Dr. Jayesh M. Patel 89

9 STRING-1 International Journal of Computer Engineering and Applications, Table 5 Distance value of GUJSIM on Gujarati words STRING- 2 Encoded-1 Encoded-2 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING GUJSIM ન દર નનન દર LBLJN LBLJN અનસ ધ ન અનસન ધ ન ALCQLJAL ALCQLJAL ક ત ન કન ત ન FLJAL FLJAL સ સ ધન સ શ ધન QLQEJL QLQEJL ગ ધ જ ગ ન ધ જ FALJBHB FALJBHB પ રનતક પ રત ક KNJBF KNJBF ત લ મ ત નલમ JAOB JAOBL સમજ ત સમજ ત QLHCJ QLHCJB અનકરણ ય અન કરણ ય ALCFNLM ALCFNLBM ગનહણ ગહ ણ FCRBLB FCRBLB કનમટ કમ ટ FLBIB FLBIB આલ બમ આલબમ AOKL AOKL ખશનમ ખસનમ FCQLCLA FCQLCLA મ જ ર મન જ ર LLHCN LLHCN અ ક ર અન ક ર ALFCN ALFCN ક ર સ ન ક ર શ ન FDNEQBL FDNEQBL આજ ન ક આજ ક AHBPBFA AHBPBFA ફરજ ય ત ફરનજય ત KNHBMAJ KNHBMAJ સ ત ષ સ ત શ QLJEQ QLJEQ ચ દ રક ત ચન દ રક ન ત HLJNFALJ HLJNFALJ નદ ળ નદ લ JBPAOB JBPAOB પરમ ત મ પરમ તમ KNLAJLA KNLAJLA ઈશ વર ઈસ ર BQPN BQPN નજન દ જ દ HBLJA HBLJA ન દર નનન દર LBLJN LBLJN નનરપ ક ષ નનરપ કશ LBNKDFQ LBNKDFQ અનસ ધ ન અનસન ધ ન ALCQLJAL ALCQLJAL ક ત ન કન ત ન FLJAL FLJAL સ સ ધન સ શ ધન QLQEJL QLQEJL સ ર થક સ રર ક QANJF QANJF પત ન પતન KJLB KJLB ન શ વ સ ન સ સ PBQPAQC PBQPAQC પ રસ દ પરસ દ KNQAJ KNQAJB કણથ કરણ FNL FNL ઓમક ર શ વર ઓમ ક ર શ વર ELFANDPN ELFANDQPN ર નપય ર પ ય NCKBMA NCKBMA જ લ સ ર જ ળ શ વર HAODQPN HAODQPN ક ષન ય કશ ય FQJNBM FQJNBM નનરપ ક ષ ન રપ ક શ LBNKDFQ LBNKDFQ આ રદ આવ રદ APNJA APNJA ભ લ ભલ KCO KCO Chandrakant D. Patel and Dr. Jayesh M. Patel 90

10 STRING SIMILARITY BASED ON PHONETIC IN GUJARATI LANGUAGE USING GUJSIM ALGORITHM STRING-1 STRING- 2 Encoded-1 Encoded-2 JACCARD SORENSEN LEVENSHTEIN FASTCOMP HAMMING GUJSIM નદન દ ન JBL JBL ધમથ ધરમ JNL JNL કમથ કરમ FNL FNL ગ લ ન ગલ ન FOALB FOALB પ ણ પ ન KALB KALB ગણ ત ગન ત FCLPJA FCLPJA અન ત અનનત ALBJA ALBJA ક નત ક નન ત FNALJB FNALJB નદ સ દ સ JBPQ JBPQ ક ષ કષ FQDJN FQDJN અર થ અરર ANJ ANJ જ ઈન ટ જ ઈ ટ HEBLI HEBLI પ પતર KCJN KCJN પ નન સલ પ નનસલ KDLQBO KDLQBO પ થ પ ર KCNJPB KCNJPB નશ શ QBP QBP ક મ યટર ક મ યટર FELKMCIN FELKMCIN ગણ શજ ગન શનજ FLDQHB FLDQHB નશક ર સ ક ર QBFAN QBFAN શ મ ત સ મ ત QBLLJ QBLLJ જમર ખ જમ રખ HLNCF HLNCF શ નત સ ત QALJB QALJB નશ લ સ ળ QBPAOA QBPAOA લક ષ મ જ લકશમ જ OFQLBHB OFQLBHB મ ઘદ ત મ ગદ ત LDFJCJ LDFJCJ દ ત દ ન ત PDJALJ PDJALJ ગ ય ન ગય ન FMALB FMALB જ ઞ ન જ ઞ નન HGALB HGALB આ રદ આવ રદ APNJA APNJA Chandrakant D. Patel and Dr. Jayesh M. Patel 91

11 International Journal of Computer Engineering and Applications, REFERENCES [1] R. W. Hamming, Error Detecting and Error Correcting Codes, Bell System Technical Journal, vol. 29, no. 2, pp , [2] S. Wandelt, S. Gerdjikov, E. Siragusa, A. Tiskin, W. Wang, and J. Wang, State-of-the-art in String Similarity Search and Join, SIGMOD Record, vol. 43, no. 1, pp , [3] C. D. Patel and J. M. Patel, A Review of Indian and Non-Indian Stemming : A focus on Gujarati Stemming Algorithms, International Journal of Advanced Research, vol. 3, no. 12, pp , [4] C. D. Patel, GUJSTER: a Rule based stemmer using Dictionary Approach, in IEEE - International Conference on Inventive Communication and Computational Technologies, 2017, no. March-2017, pp [5] C. D. Patel and T. Modi, A mode of web page categorization of Gujarati language, Journal Of Information, Knowledge And Research In Computer Science And Applications, vol. 4, no. 1, pp , [6] C. K. et. al. Bhensdadia, Introduction to Gujarati Wordnet, Proc. Third National Workshop on IndoWordNet, pp. 1 5, [7] S. S. Panchal, Gujarati WordNet A Lexical Database, International Journal of Computer Applications, vol. 116, no. 20, pp. 6 8, [8] C. D. Patel and J. M. Patel, Improving a Lightweight Stemmer for Gujarati Language, International Journal of Information Sciences and Techniques, vol. 6, no. 1, pp , [9] V. Levenshtein, Levenshtein Distance, pp. 1 6, [10] W. J. Masek, A Faster Algorithm Computing String Edit Distances*, Journal of Computer And System Sciences, vol. 1, no. 20, pp , [11] W. B. Frakes and C. J. Fox, Strength and similarity of affix removal stemming algorithms, ACM SIGIR Forum, vol. 37, no. 1, pp , [12] E. Ukkonen, Algorithms for approximate string matching, Information and Control, vol. 64, no. 1 3, pp , [13] V. Abbasi, Phonetic Analysis and Searching with Google Glass API, [14] C. Project and O. Technical, Caverphone : Phonetic Matching algorithm, vol. 12, no. Table 1, [15] J. Sheth, Gujarati Phonetics and Levenshtein based String Similarity Measure for Gujarati Language, NCILC, [16] V. P. Parmar, Study Existing Various Phonetic Algorithms and Designing and Development of a working model for the New Developed Algorithm and Comparison by implementing it with Existing Algorithm ( s ), International Journal of Computer Applications, vol. 98, no. 19, pp , [17] K. Rieck KONRADRIECK and C. Christian Wressnegger, Harry: A Tool for Measuring String Similarity, Journal of Machine Learning Research, vol. 17, pp. 1 5, [18] D. Hood, Caverphone : Phonetic Matching algorithm, Caversham Project Occasional Technical Paper, vol. 12, no. CTP060902, pp. 1 11, Chandrakant D. Patel and Dr. Jayesh M. Patel 92

GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering - SEMESTER II (CtoD) EXAMINATION WINTER 2014

GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering - SEMESTER II (CtoD) EXAMINATION WINTER 2014 Seat No.: Enrolment No. GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering - SEMESTER II (CtoD) EXAMINATION WINTER 2014 Subject Code: C320903 Date: 29-12-2014 Subject Name: D. C. CIRCUIT Time: 10:30

More information

Gujarat Post office (MTS) Main Exam Study material part 2

Gujarat Post office (MTS) Main Exam Study material part 2 Gujarat Post office (MTS) Main Exam Study material part 2 1) Map શબ દ ક ન પરથ બન લ છ? નકશ મ ટ અ ગ જ શબ દ Map લ ટટન શબ દ Mappa અથવ Mappa mundi પરથ બન લ છ. ત ન અ થ હ થર મ લ કદન ક પડન ટ કડ થ ય છ. ઈ.સ.840મ

More information

જ.ક ર. ૮૮/૨૦૧૬-૧૭ જગ મ ન ન ભ : વયક ય ક ર જ ભ બ ત કળ સ ત ર તલમન ભદદન ળ પ ર ધ મ ક, લ ગ-૨

જ.ક ર. ૮૮/૨૦૧૬-૧૭ જગ મ ન ન ભ : વયક ય ક ર જ ભ બ ત કળ સ ત ર તલમન ભદદન ળ પ ર ધ મ ક, લ ગ-૨ જ.ક ર. ૮૮/૨૦૧૬-૧૭ જગ મ ન ન ભ : વયક ય ક ર જ ભ બ ત કળ સ ત ર તલમન ભદદન ળ પ ર ધ મ ક, લ ગ-૨ બ ગ-૧ અન બ ગ-૨ ન ૧૫૦ તભતનટન વ યક પ રશ નત રન પ ર થતભક કવ ટ ન અભ મ વક રભ પ ર થતભક કવ ટ ન અભ મ વક રભ બ ગ-૧ ભ ધ મભ: ગજય

More information

Physical Setting & Earth Science. Glossary. High School Level. English / Gujarati

Physical Setting & Earth Science. Glossary. High School Level. English / Gujarati High School Level Glossary Physical Setting & Earth Science Glossary English / Gujarati Translation of Physical Setting & Earth Science terms based on the Coursework for Physical Setting & Earth Science

More information

GUJARAT VIDYAPEETH : AHMEDABAD M.D.

GUJARAT VIDYAPEETH : AHMEDABAD M.D. GUJARAT VIDYAPEETH : AHMEDABAD MIC-101:Introduction and Historical Developments in Microbiology (Syllabus of theoretical portion) (With Effect From 1 st July, 2016) (Total Teaching Hours=30, Credit=02)

More information

fresh bond and on usual terms and conditions. Rule is made absolute accordingly. Direct service is permitted.

fresh bond and on usual terms and conditions. Rule is made absolute accordingly. Direct service is permitted. 242 ACQUITTAL AND BAIL CASES ABC 2013 (I) rs on n on usul trms n onitions. Rul is m solut orinly. Dirt srvi is prmitt. Rsult: - Applition sus. ABC 2013 (I) 242GUJ ACQUITTAL & BAIL CASES HIGH COURT OF GUJARAT

More information

(b) Answer the following Explain addition polymerization with example. 2. Give the difference between paint and varnish.

(b) Answer the following Explain addition polymerization with example. 2. Give the difference between paint and varnish. Seat No.: Enrolment No. GUJARAT TECHNOLOGICAL UNIVERSITY Diploma Engineering Semester I/II Examination WINTER 2013 Subject code: 320002 Date: 20/12/2013 Subject Name: Applied Science - II (CHEMISTRY) Time:

More information

ABC 2011 (II) 111 GUJ

ABC 2011 (II) 111 GUJ Arvini Mnkll Ptl Vs. Stt o Gujrt &Anr (Guj) 111 s not on to t otr ontntions nvss y lrn vots pprin on l o t rsptiv prtis s impun omplint/ Criminl Cs is rquir to qus n st si on t orsi roun lon. 11. In viw

More information

V;ZSFZS JU"B\0 lx1f6 શ ક ષક ત લ મ મ ટ ન MT ત લ મ વર ગ સ નલ ક. ચ હ ણ લ ક ચરર ડ ય ટ - ર જક ટ L/O/G/O 5/29/2014 સ નલ ક. ચ હ ણ, લ કચરર, ડ ય ટ- ર જક ટ

V;ZSFZS JUB\0 lx1f6 શ ક ષક ત લ મ મ ટ ન MT ત લ મ વર ગ સ નલ ક. ચ હ ણ લ ક ચરર ડ ય ટ - ર જક ટ L/O/G/O 5/29/2014 સ નલ ક. ચ હ ણ, લ કચરર, ડ ય ટ- ર જક ટ 1 V;ZSFZS JU"B\0 lx1f6 L/O/G/O શ ક ષક ત લ મ મ ટ ન MT ત લ મ વર ગ સ નલ ક. ચ હ ણ લ ક ચરર ડ ય ટ - JU"jIJCFZ jilstut VlEUD lx1f6 5 ls IFGM GFGM EFU 5 tifig 5 ls IF 2 asrkirk vgd (SxN n)c[n) bibti[ pr aifir)t

More information

SAURASHTRA UNIVERSITY STRUCTURE OF THE COURSES. SUBJECT : GEOGRAPHY B.A. ALL SEMESTER (Revised Syllabus in Force from: June-2016)

SAURASHTRA UNIVERSITY STRUCTURE OF THE COURSES. SUBJECT : GEOGRAPHY B.A. ALL SEMESTER (Revised Syllabus in Force from: June-2016) SAURASHTRA UNIVERSITY STRUCTURE OF THE COURSES 0F SUBJECT : GEOGRAPHY B.A. ALL SEMESTER (Revised Syllabus in Force from: June-2016) Page 1 of 49 Sr No Level Sem Course Group SAURASHTRA UNIVERSITY ARTS

More information

STD 5 th F.A 1 EXAMINATION MARKS 30 SUB CAMBRIDGE

STD 5 th F.A 1 EXAMINATION MARKS 30 SUB CAMBRIDGE STD 5 th F.A 1 EXAMINATION MARKS 30 SUB CAMBRIDGE DATE Q1 MATCHING. (5) 1 Pack of Beas 2 Suite of Drawers 3 Swarm of Actor 4 Chest of Furniture 5 Company of Wolves Q2 FILL IN THE BLANKS WITH PROPERNONN

More information

Neo-Brahmi Generation Panel Meeting

Neo-Brahmi Generation Panel Meeting 1 Neo-Brahmi Generation Panel Meeting Overall Progress Ajay Data, Mahesh Kulkarni, Udaya Narayana Singh, NBGP Co-Chairs Presented by Akshat Joshi ICANN 60 31 October 2017 2 Why the name Neo-Brāhmī? Background:

More information

1. 5ZL1FF 5]l:TSFGF VF 5FGF 5Z E}ZL q SF/L AM, 5M.g85[G YL ljutm T]Z\T EZJLP 5[lg;,GM p5imu ;J"YF lglqfâ K[P

1. 5ZL1FF 5]l:TSFGF VF 5FGF 5Z E}ZL q SF/L AM, 5M.g85[G YL ljutm T]Z\T EZJLP 5[lg;,GM p5imu ;JYF lglqfâ K[P This booklet contains 28 printed pages. VF 5]l:TSFDF\ 28 5FGF D]lãT K[P PAPER 1 : PHYSICS, MATHEMATICS & CHEMISTRY 5ZL1FF 5]l:TSF v 1 o EF{lTS XF:+4 Ul6T VG[ Z;FI6 XF:+ Do not open this Test Booklet until

More information

NEW ERA SENIOR SECONDARY SCHOOL, NIZAMPURA SUMMER HOLIDAY ASSIGNMENT CLASS: IX

NEW ERA SENIOR SECONDARY SCHOOL, NIZAMPURA SUMMER HOLIDAY ASSIGNMENT CLASS: IX (P.T.O.) NEW ERA SENIOR SECONDARY SCHOOL, NIZAMPURA SUMMER HOLIDAY ASSIGNMENT CLASS: IX NAME: ROLL NO.: DIV.: ENGLISH 1) Read the book Three men in a boat.write any three funny incidents of the story.

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link

More information

Preliminary proposal to encode Devanagari letter numerals

Preliminary proposal to encode Devanagari letter numerals Srinidhi A and Sridatta A Tumakuru, India srinidhi.pinkpetals24@gmail.com, sridatta.jamadagni@gmail.com 1 Introduction January 17, 2017 This is preliminary proposal to encode letter numerals found in Devanagari

More information

Paterson Public Schools

Paterson Public Schools A. Concepts About Print Understand how print is organized and read. (See LAL Curriculum Framework Grade KR Page 1 of 12) Understand that all print materials in English follow similar patterns. (See LAL

More information

Feature Extraction Using Zernike Moments

Feature Extraction Using Zernike Moments Feature Extraction Using Zernike Moments P. Bhaskara Rao Department of Electronics and Communication Engineering ST.Peter's Engineeing college,hyderabad,andhra Pradesh,India D.Vara Prasad Department of

More information

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Parth Mehta DA-IICT, Gandhinagar parth.mehta126@gmail.com Prasenjit Majumder DA-IICT, Gandhinagar prasenjit.majumder@gmail.com

More information

Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese

Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese 1 1 n-gram 2 Applying Phonetic Matching Algorithm to Tongue Twister Retrieval in Japanese Michiko Yasukawa 1 and Hidetoshi Yokoo 1 In this paper, we propose a Japanese phonetic matching algorithm for tongue

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

An Improved Stemming Approach Using HMM for a Highly Inflectional Language

An Improved Stemming Approach Using HMM for a Highly Inflectional Language An Improved Stemming Approach Using HMM for a Highly Inflectional Language Navanath Saharia 1, Kishori M. Konwar 2, Utpal Sharma 1, and Jugal K. Kalita 3 1 Department of CSE, Tezpur University, India {nava

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

Lecture 1b: Text, terms, and bags of words

Lecture 1b: Text, terms, and bags of words Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection

More information

Second Grade First Nine Week ELA Study Guide 2016

Second Grade First Nine Week ELA Study Guide 2016 Second Grade First Nine Week ELA Study Guide 2016 Second grade students will take their nine-week tests the week of October 17, 2016. For the ELA Test, students will read 3 passages (a fictional selection,

More information

Concepts: Understand the meaning of words related to dinosaurs and paleontologists; describe dinosaurs; describe the work of paleontologists.

Concepts: Understand the meaning of words related to dinosaurs and paleontologists; describe dinosaurs; describe the work of paleontologists. DY 1 Concept & Vocabulary Development Grade 5 Unit Question of the Week: How can paleontologists help us understand the past? To introduce and discuss concepts and vocabulary related to dinosaurs and paleontology.

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Paterson Public Schools

Paterson Public Schools By this marking period students should be applying skills and strategies learned in the first three marking periods. (See LAL Curriculum frameworks Grade KR page 1 of 12) Identifying front cover, back

More information

VEER NARMAD SOUTH GUJARAT UNIVERSITY, SURAT. SYLLABUS FOR B.Sc. (MATHEMATICS) Semester: I, II Effective from December 2013

VEER NARMAD SOUTH GUJARAT UNIVERSITY, SURAT. SYLLABUS FOR B.Sc. (MATHEMATICS) Semester: I, II Effective from December 2013 Semester: I, II Effective from December 2013 Semester Paper Name of the Paper Hours Credit Marks I II MTH-101 Trigonometry 3 3 MTH-102 Differential Calculus 3 3 MTH-201 Theory of Matrices 3 3 MTH-202 Integral

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

OXFORD DICTIONARY OF CHEMISTRY EPUB

OXFORD DICTIONARY OF CHEMISTRY EPUB 14 March, 2018 OXFORD DICTIONARY OF CHEMISTRY EPUB Document Filetype: PDF 88.69 KB 0 OXFORD DICTIONARY OF CHEMISTRY EPUB Virus and Malware free No extra costs. Download the 1 Oxford Dictionary of Chemistry

More information

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

Spatial Role Labeling CS365 Course Project

Spatial Role Labeling CS365 Course Project Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important

More information

Soundex distance metric

Soundex distance metric Text Algorithms (4AP) Lecture: Time warping and sound Jaak Vilo 008 fall Jaak Vilo MTAT.03.90 Text Algorithms Soundex distance metric Soundex is a coarse phonetic indexing scheme, widely used in genealogy.

More information

NLTK and Lexical Information

NLTK and Lexical Information NLTK and Lexical Information Marina Sedinkina - Folien von Desislava Zhekova - CIS, LMU marina.sedinkina@campus.lmu.de December 12, 2017 Marina Sedinkina- Folien von Desislava Zhekova - Language Processing

More information

Effective Business Writing Crossword Puzzle The Writing Center, Inc. All rights reserved.

Effective Business Writing Crossword Puzzle The Writing Center, Inc. All rights reserved. Effective Business Writing Crossword Puzzle 2009 The Writing Center, Inc. All rights reserved. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

More information

Language Arts. Kindergarten. First. Second. The Lower School provides a balanced literacy program that includes:

Language Arts. Kindergarten. First. Second. The Lower School provides a balanced literacy program that includes: Language Arts The Lower School provides a balanced literacy program that includes: modeled reading (read aloud), guided reading (small group instruction), shared reading, independent reading modeled writing,

More information

arxiv: v2 [cs.ds] 9 Nov 2017

arxiv: v2 [cs.ds] 9 Nov 2017 Replace or Retrieve Keywords In Documents At Scale Vikash Singh Belong.co Bangalore, India vikash@belong.co arxiv:1711.00046v2 [cs.ds] 9 Nov 2017 Abstract In this paper we introduce, the FlashText 1 algorithm

More information

The Two Time Pad Encryption System

The Two Time Pad Encryption System Hardware Random Number Generators This document describe the use and function of a one-time-pad style encryption system for field and educational use. You may download sheets free from www.randomserver.dyndns.org/client/random.php

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Gujarat Subordinate Service Selection Board (GSSSB) Syllabus

Gujarat Subordinate Service Selection Board (GSSSB) Syllabus GSSSB SI Syllabus - English Sentence Arrangement Sentence Improvement Spotting Errors Antonyms Sentence Completion Degrees of comparison Direct & Indirect speech Joining Sentences Passage Completion Substitution

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H. Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Overview. 1. Introduction to Propositional Logic. 2. Operations on Propositions. 3. Truth Tables. 4. Translating Sentences into Logical Expressions

Overview. 1. Introduction to Propositional Logic. 2. Operations on Propositions. 3. Truth Tables. 4. Translating Sentences into Logical Expressions Note 01 Propositional Logic 1 / 10-1 Overview 1. Introduction to Propositional Logic 2. Operations on Propositions 3. Truth Tables 4. Translating Sentences into Logical Expressions 5. Preview: Propositional

More information

Comments on Proposal to Encode the Saurashtra Script in the UCS Vide: ISO/IEC JTC1/SC2/EG2 N3607 LN/ by Peri Bhaskararao

Comments on Proposal to Encode the Saurashtra Script in the UCS Vide: ISO/IEC JTC1/SC2/EG2 N3607 LN/ by Peri Bhaskararao Comments on Proposal to Encode the Saurashtra Script in the UCS Vide: ISO/IEC JTC1/SC2/EG2 N3607 LN/03-231 by Peri Bhaskararao bhaskar@aa.tufs.ac.jp [Section numbers below refer to the section numbers

More information

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

Cape Henlopen School District Curriculum Map Grade 1. September

Cape Henlopen School District Curriculum Map Grade 1. September September Look Who s Recognize letters and relate sound to the symbol Identify initial and final consonants Blend sounds orally Listen for vowel sounds Recognize parts of a book Identify and generate rhymes

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in

More information

LANGUAGE ARTS Junior Elementary

LANGUAGE ARTS Junior Elementary LANGUAGE ARTS History of Writing Oral Language Handwriting Development Development of manuscript Development of cursive Word Study A Vowels and consonants Building words phonetically Silent e phonograms

More information

Possibility of Existence and Identification of Diphthongs and Triphthongs in Urdu Language

Possibility of Existence and Identification of Diphthongs and Triphthongs in Urdu Language 16 Kiran Khurshid, Salman Ahmad Usman and Nida Javaid Butt Possibility of Existence and Identification of Diphthongs and Triphthongs in Urdu Language Abstract: This paper gives an account of possible diphthongs

More information

Native Hawaiian Dry Forest Plants. Developed by: Nick DeBoer and Sylvie Bright

Native Hawaiian Dry Forest Plants. Developed by: Nick DeBoer and Sylvie Bright Grade Level: Sixth grade Native Hawaiian Dry Forest Plants Developed by: Nick DeBoer and Sylvie Bright Purpose: This curriculum is designed to communicate: I. How dry forest ecosystems are affected by

More information

Bhakta Kavi Narsinh Mehta University, Junagadh

Bhakta Kavi Narsinh Mehta University, Junagadh Bhakta Kavi Narsinh Mehta University, Junagadh FACULTY OF ARTS SYLLABUS FOR M.A. (GEOGRAPHY ) SEMESTER - 1 to 4 (FOR REGULAR CANDIDATES ) Effective from JUNE 2016 - 100-100 - 100-100 100 100 100 Bhakta

More information

The OntoNL Semantic Relatedness Measure for OWL Ontologies

The OntoNL Semantic Relatedness Measure for OWL Ontologies The OntoNL Semantic Relatedness Measure for OWL Ontologies Anastasia Karanastasi and Stavros hristodoulakis Laboratory of Distributed Multimedia Information Systems and Applications Technical University

More information

Request for editorial updates to various Indic scripts. 1. Generic Indic

Request for editorial updates to various Indic scripts. 1. Generic Indic Request for editorial updates to various Indic scripts Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2012-Mar-17 1. Generic Indic The phonological sequence /r vocalic_r/ occurs now and then in Sanskrit.

More information

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. deeplearning.ai. Why sequence models? Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There

More information

Optical Character Recognition of Jutakshars within Devanagari Script

Optical Character Recognition of Jutakshars within Devanagari Script Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 03 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Properties of Text Zipf s Law models the distribution of terms

More information

Syntax versus Semantics:

Syntax versus Semantics: Syntax versus Semantics: Analysis of Enriched Vector Space Models Benno Stein and Sven Meyer zu Eissen and Martin Potthast Bauhaus University Weimar Relevance Computation Information retrieval aims at

More information

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview

More information

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling Department of Computer Science and Engineering Indian Institute of Technology, Kanpur CS 365 Artificial Intelligence Project Report Spatial Role Labeling Submitted by Satvik Gupta (12633) and Garvit Pahal

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

Toponym Disambiguation using Ontology-based Semantic Similarity

Toponym Disambiguation using Ontology-based Semantic Similarity Toponym Disambiguation using Ontology-based Semantic Similarity David S Batista 1, João D Ferreira 2, Francisco M Couto 2, and Mário J Silva 1 1 IST/INESC-ID Lisbon, Portugal {dsbatista,msilva}@inesc-id.pt

More information

Second Grade Pre Post Test. Teacher s Edition

Second Grade Pre Post Test. Teacher s Edition P r e - P o s t T e s t 2 n d G r a d e T e a c h e r s E d i t i o n 1 Second Grade Pre Post Test Teacher s Edition English Pre Post Test Teacher s Edition 2 nd Grade P r e - P o s t T e s t 2 n d G r

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2018 Sharon Goldwater ANLP Lecture

More information

Phonological Correspondence as a Tool for Historical Analysis: An Algorithm for Word Alignment

Phonological Correspondence as a Tool for Historical Analysis: An Algorithm for Word Alignment Phonological Correspondence as a Tool for Historical Analysis: An Algorithm for Word Alignment Daniel M. Albro March 11, 1997 1 Introduction In historical linguistics research it is often necessary to

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Source Coding Techniques

Source Coding Techniques Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Information Extraction from Biomedical Text. BMI/CS 776  Mark Craven Information Extraction from Biomedical Text BMI/CS 776 www.biostat.wisc.edu/bmi776/ Mark Craven craven@biostat.wisc.edu Spring 2012 Goals for Lecture the key concepts to understand are the following! named-entity

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Conceptual Similarity: Why, Where, How

Conceptual Similarity: Why, Where, How Conceptual Similarity: Why, Where, How Michalis Sfakakis Laboratory on Digital Libraries & Electronic Publishing, Department of Archives and Library Sciences, Ionian University, Greece First Workshop on

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Fibonacci Coding for Lossless Data Compression A Review

Fibonacci Coding for Lossless Data Compression A Review RESEARCH ARTICLE OPEN ACCESS Fibonacci Coding for Lossless Data Compression A Review Ezhilarasu P Associate Professor Department of Computer Science and Engineering Hindusthan College of Engineering and

More information

Text and multimedia languages and properties

Text and multimedia languages and properties Text and multimedia languages and properties (Modern IR, Ch. 6) I VP R 1 Introduction Metadata Text Markup Languages Multimedia Trends and Research Issues Bibliographical Discussion I VP R 2 Introduction

More information

Learning Textual Entailment using SVMs and String Similarity Measures

Learning Textual Entailment using SVMs and String Similarity Measures Learning Textual Entailment using SVMs and String Similarity Measures Prodromos Malakasiotis and Ion Androutsopoulos Department of Informatics Athens University of Economics and Business Patision 76, GR-104

More information

AP Human Geography. Course Materials

AP Human Geography. Course Materials AP Human Geography This is a syllabus for a two semester Advanced Placement Human Geography course that has been offered for several years at this school. The material covered is based on the AP Human

More information

Magic Words And Language Patterns The Hypnotists Essential Guide To Crafting Irresistible Suggestions Handbook For Scriptless Hypnosis

Magic Words And Language Patterns The Hypnotists Essential Guide To Crafting Irresistible Suggestions Handbook For Scriptless Hypnosis Magic Words And Language Patterns The Hypnotists Essential Guide To Crafting Irresistible Suggestions We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs

More information

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College

More information

Regular expressions and automata

Regular expressions and automata Regular expressions and automata Introduction Finite State Automaton (FSA) Finite State Transducers (FST) Regular expressions(i) (Res) Standard notation for characterizing text sequences Specifying text

More information

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage: Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1 This lecture Language Models What are

More information

Introduction to Theoretical Computer Science. Motivation. Automata = abstract computing devices

Introduction to Theoretical Computer Science. Motivation. Automata = abstract computing devices Introduction to Theoretical Computer Science Motivation Automata = abstract computing devices Turing studied Turing Machines (= computers) before there were any real computers We will also look at simpler

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information