A Geometric Method to Obtain the Generation Probability of a Sentence

Similar documents
Natural Language Processing (CSE 490U): Language Models

Lecture 3. Probability and elements of combinatorics

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Final Examination CS540-2: Introduction to Artificial Intelligence

Cambridge IGCSE Mathematics

Mathematics (Project Maths Phase 3)

a 2 + b 2 = (p 2 q 2 ) 2 + 4p 2 q 2 = (p 2 + q 2 ) 2 = c 2,

BHP BILLITON UNIVERSITY OF MELBOURNE SCHOOL MATHEMATICS COMPETITION, 2003: INTERMEDIATE DIVISION

International Mathematical Olympiad. Preliminary Selection Contest 2017 Hong Kong. Outline of Solutions 5. 3*

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for natural language processing

Elements of probability theory

Chapter XI Novanion rings

CBSE Sample Paper-04 (solved) SUMMATIVE ASSESSMENT II MATHEMATICS Class IX. Time allowed: 3 hours Maximum Marks: 90

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Exhaustion: From Eudoxus to Archimedes

Circles, Mixed Exercise 6

Canadian Open Mathematics Challenge

Sec 4 Maths SET D PAPER 2

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Unit 1. GSE Analytic Geometry EOC Review Name: Units 1 3. Date: Pd:

Lecture 8: Probability

MATHEMATICS UNIT 2: CALCULATOR-ALLOWED HIGHER TIER

ERRATA MATHEMATICS FOR THE INTERNATIONAL STUDENT MATHEMATICS HL (CORE) (2nd edition)

SOLUTION. Taken together, the preceding equations imply that ABC DEF by the SSS criterion for triangle congruence.

Study guide for Exam 1. by William H. Meeks III October 26, 2012

UNCC 2001 Comprehensive, Solutions

Language as a Stochastic Process

2.4 Investigating Symmetry

2013 Sharygin Geometry Olympiad

Geometry Chapter 3 3-6: PROVE THEOREMS ABOUT PERPENDICULAR LINES

Notes: Review of Algebra I skills

Chapter 3 Summary 3.1. Determining the Perimeter and Area of Rectangles and Squares on the Coordinate Plane. Example

1 / 23

Topic 3: Introduction to Probability

CHAPTER 1 NUMBER SYSTEMS. 1.1 Introduction

Geometry Facts Circles & Cyclic Quadrilaterals

The High School Section

Hanoi Open Mathematical Competition 2017

BETWEEN PAPERS PRACTICE (F&H)

MATHEMATICS Unit Pure Core 2

Axiomatic set theory. Chapter Why axiomatic set theory?

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Gauss School and Gauss Math Circle 2017 Gauss Math Tournament Grade 7-8 (Sprint Round 50 minutes)

In this initial chapter, you will be introduced to, or more than likely be reminded of, a

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

THE FIVE GROUPS OF AXIOMS.

1. If two angles of a triangle measure 40 and 80, what is the measure of the other angle of the triangle?

Individual Round CHMMC November 20, 2016

6664/01 Edexcel GCE Core Mathematics C2 Bronze Level B4

MATHS (O) NOTES. SUBJECT: Maths LEVEL: Ordinary Level TEACHER: Jean Kelly. The Institute of Education Topics Covered: Complex Numbers

arxiv: v2 [math.gm] 20 Dec 2018

2007 Hypatia Contest

Pre RMO Exam Paper Solution:

Chapter 11 - Sequences and Series

4 Pictorial proofs. 1. I convert 40 C to Fahrenheit: = I react: Wow, 104 F. That s dangerous! Get thee to a doctor!

3. Applying the definition, we have 2#0 = = 5 and 1#4 = = 0. Thus, (2#0)#(1#4) = 5#0 = ( 5) 0 ( 5) 3 = 2.

Answer all questions in the boxes provided. Working may be continued below the lines if necessary.

The analysis method for construction problems in the dynamic geometry

Paper1Practice [289 marks]

St. Anne s Diocesan College. Form 6 Core Mathematics: Paper II September Time: 3 hours Marks: 150

Math Number 842 Professor R. Roybal MATH History of Mathematics 24th October, Project 1 - Proofs

13 Spherical geometry

Language Model. Introduction to N-grams

Math Day at the Beach 2017 Solutions

Math 234. What you should know on day one. August 28, You should be able to use general principles like. x = cos t, y = sin t, 0 t π.

Learning Features from Co-occurrences: A Theoretical Analysis

Nozha Directorate of Education Form : 2 nd Prep. Nozha Language Schools Ismailia Road Branch

MATHEMATICS TEST 60 Minutes 60 Questions

2 M13/5/MATME/SP2/ENG/TZ1/XX 3 M13/5/MATME/SP2/ENG/TZ1/XX Full marks are not necessarily awarded for a correct answer with no working. Answers must be

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

2015 Canadian Team Mathematics Contest

COT 2104 Homework Assignment 1 (Answers)

Foundations of Neutral Geometry

Circle geometry investigation: Student worksheet

Prompt. Commentary. Mathematical Foci

May 2015 Timezone 2 IB Maths Standard Exam Worked Solutions

Grade Math (HL) Curriculum

CIVL 7012/8012. Basic Laws and Axioms of Probability

Coimisiún na Scrúduithe Stáit State Examinations Commission. Leaving Certificate Examination Mathematics

Chapter Learning Objectives. Random Experiments Dfiii Definition: Dfiii Definition:

BMOS MENTORING SCHEME (Senior Level) February 2011 (Sheet 5) Solutions

From Distributions to Markov Networks. Sargur Srihari

The sum x 1 + x 2 + x 3 is (A): 4 (B): 6 (C): 8 (D): 14 (E): None of the above. How many pairs of positive integers (x, y) are there, those satisfy

Midterm Study Guide Assessment. Question 1. Find the angle measure. (28x + 14) (16x + 62) The measure of the angles are. 11/30/2018 Print Assignment

( 1 ) Show that P ( a, b + c ), Q ( b, c + a ) and R ( c, a + b ) are collinear.

Chapter 3: Basics of Language Modelling

1 Hanoi Open Mathematical Competition 2017

(D) (A) Q.3 To which of the following circles, the line y x + 3 = 0 is normal at the point ? 2 (A) 2

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

SYNTHER A NEW M-GRAM POS TAGGER

Winning probabilities in a pairwise lottery system with three alternatives

HIGHER SCHOOL CERTIFICATE EXAMINATION MATHEMATICS 2/3 UNIT (COMMON) Time allowed Three hours (Plus 5 minutes reading time)

10/17/04. Today s Main Points

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

BREVET BLANC N 1 JANUARY 2012

STRAND J: TRANSFORMATIONS, VECTORS and MATRICES

Conditional Independence

COM111 Introduction to Computer Engineering (Fall ) NOTES 6 -- page 1 of 12

Transcription:

A Geometric Method to Obtain the Generation Probability of a Sentence Chen Lijiang Nanjing Normal University ljchen97@6.com Abstract "How to generate a sentence" is the most critical and difficult problem in all the natural language processing technologies. In this paper, we present a new approach to explain the generation process of a sentence from the perspective of mathematics. Our method is based on the premise that in our brain a sentence is a part of a word network which is formed by many word nodes. Experiments show that the probability of the entire sentence can be obtained by the probabilities of single words and the probabilities of the co-occurrence of word pairs, which indicate that human use the synthesis method to generate a sentence. How the human brain realizes natural language understanding is still an open question. With the development of statistical natural language processing technology, people gradually realized that we should use the possibility to determine whether a sentence is reasonable, which means use the probability to measure the grammatical well-formedness of a sentence. If the probability of the first sentence is 5 0, the probability of the second sentence is 8 0, and the probability of the third sentence is 0 0. Then the first sentence is most likely correct sentence, which is 000 times more likely to be correct than the second sentence, and 00,000 times than the third sentence. We use P x x x n x ) to represent the probability of the sentence, where n is ( n the number of words in a sentence, x i and x ( i, j n) are two words in a j

sentence. Using the conditional probability formula, P x x x n x ) can be ( n expanded as follows: P( xx xn xn ) P( x ) P( x x) P( x x, x) P( xn x, x, xn ) If we have enough data, the probability of the first word P x ) and the conditional probability of the second word P x x ) should be very easy to calculate. ( However, because of fewer occurrences of x xx in the corpus, it is very difficult to calculate the conditional probability of the third word P x x, ). When n is large ( ( x enough, even we have enough data, the number of times that x x x n x appears n may be very small, and x x n x even never appears in the data files. Thus x n estimating P x x x n x ) becomes an impossible task. We call this Data Sparsity ( n Problem. In order to overcome the difficulty of data sparsity, a famous mathematician Andrey Markov found a solution that assumes the probability of any word is only related to the words in front of it. Thus find the probability of such a sentence becomes simple: P( xx xn xn ) P( x ) P( x x ) P( x x) P( xn xn ) But this method would ignore many long-distance relationships between words in a sentence, such as P x x ) P x 5 x ) et al. Therefore, to determine the exact ( ( probability of a sentence, it is necessary to find a new approach. We use a network structure of words to show the relationship between words (Figure ), in which each dot represents a word and each edge represents a relationship between two words. The edge between two dots indicates that there is an association between these two words; otherwise, they are independent of each other. Weights can be added to the edges, for instance, the probability of simultaneous occurrence of two words. The dashed line in Figure represents the area that contains all the words to form a complete sentence. We found Figure is similar to the brain neural network. If each neuron in the neural network stands for a word (or other linguistic unit), the identified area in the figure can be viewed as an activated sentence. Generating a sentence can be seen as all the words in the sentence (or other linguistic units) are activated simultaneously in the neural network. If the probability of a sentence can be obtained by the probability of each activated word (or other linguistic units) being activated, and the coincidence probability of an activated

word pair(or other linguistic units), as long as the language data is sufficient, we can find the probability of a sentence being generated under any language environment. Figure activated words in the network to form a sentence Figure or a set diagram the relationship of three words can be represented as a network In the probability theory, we have such two addition equations: P( AB) P( A B) P( A) P( B) () P( ABC) P( A B C) P( A) P( B) P( C) P( AB) P( BC) P( AC) () Formula () tells us that if we want to know P (ABC), not only need to know P (A), P(C) and P (AB) P( A B C).,but also need to know Now, we have a way to obtain P (ABC) under the premise that we only know P (A), P(C) and P (AB). In the Venn diagram of three events (see Figure ), an intuitive feeling is that there is a link between the center

portion of the figure and P (ABC). By Venn diagram, we hope to derive the joint probability of three events, if the probability of one event P (A), P (C) joint probability of two events P (AB) are known. Here we experimentally validate our intuition., and the P (A) P (B) P (AB) P (ABC)? P (AC) P (BC) P (C) Figure Whether the central part of the Venn diagram can represent P (ABC) In order to test our hypothesis, we draw a Venn diagram under the premise that we know the P (A), P(C) and P (AB) circles are equal to P (A), P (C) to P (AB), P(AC) and P(BC), then according. Let the areas of three adjust the position of the three circles, so that each intersection area of two circles are equal to P (AB) the followings.. The detail is as Calculating the Area of Center part of Venn diagram We construct a Venn diagram representing the relationship between three events based on the known probabilities P (A), P (C) and P (AB). Firstly we resize the circle radiuses so the area of circles are equal to 4

P (A), P (C), and then adjust the distance between the two circle centers that P (AB) are equal to intersection areas. Step: Let A, B, C respectively be the center of the three circles (See Figure 4). Set up L AB r L BC t L AC s. By the calculation methods of plane geometry, we can get: a r b b r a arccos arccos ( ) P AB a ar b br ( a b r)( a b r)( a r b)( b r a) b t c c t b arccos arccos ( ) P BC a bt b ct ( b c t)( b c t)( b t c)( c t b) a s c c s a arccos arccos ( ) P AC a as b cs ( a c s)( a c s)( a s c)( c s a) Among them a P( A) b P( B) c P( C) Based on the three above equations, we cannot get precise results of r, s,t. But we can use the split-half method to obtain the approximating solutions of these equations. 5

P A B P C P Figure 4 Calculate the distance between the two circle centers A B C Figure 5 calculates the angle () 6

A B C Figure 6 calculates the angle () Step:calculate (see Figure 5) Known L AB r L BC t L AC s, LA LA a L b B L C c,we can determine the angle degree of ABC AB AC according to the formulas below. cos CA a s c as cos BA a r b ar cos BAC r s t rs 7

BAC BA CA BAC ( BAC CA ) ( BAC BA ) CA BA BAC arccos a s c as arccos a r b ar arccos r s t rs Similarly b arccos t c bt b arccos r a br arccos r t s rt c arccos s a cs c arccos t b ct t arccos s r ts A B C Figure 7 Calculate the area of the central part Step : determine the areas of three figures bounded by arc and line ( S ), arc and line ( S ), arc and line ( S )(Figure 7) S 60 a sin cos a S 60 b sin cos b 8

S 60 c sin cos c Step 4: calculate he area of triangle. asin L bsin L L csin Acoording to Helen formula S 4 (asin bsin csin )(asin bsin csin ) (asin csin bsin )(b sin csin asin ) Step 5: Scantral part S S S S Experiment Analysis We simulate the random trial by generating random numbers. Let event A be word A appears in a sentence, event B be word B appears in a sentence, and event C be word C appears in a sentence. To calculate P (A), P (C), P (AB), P(BC) and P (ABC), we must generate enough test results randomly. Each result is simplified as a two-value triple, such as (,0,) which means Word A appears in the sentence, word B does not appear in the sentence, and the word C appears in the sentence(table shows 5 examples). 9

n A B C 0 0 0 0 4 0 5 0 0 Table 5 results of the random trial Firstly, we randomly generated,000 copies of probabilistic data. Each copy is obtained based on 0,000 independent test results. We can get P (A), P (C), P (AB), P(BC) and P(ABC) by analyzing 0,000 results statistically, and calculate the area S of the central part of Venn diagram. Then all the data are sorted based on the size of P (ABC).The diagram below shows the relationship between the area S of the center portion and P (ABC) : Figure 8 when the test number of is 0,000, the curves of the area S of the central part of Venn diagram and P (ABC) 0

Then, we randomly generated,000 copies of probabilistic data, and each copy is obtained based on,000,000 independent test results. The diagram below shows the relationship between the area S of the center portion and P (ABC) : Figure 9 when the test number of is,000,000, the curves of the area S of the central part of Venn diagram and P (ABC) Finally, the number of tests is increased to 00,000,000 per copy. The diagram below shows the relationship between the area S of the center portion and P (ABC) : Figure 0 when the test number of is 00,000,000, the curves of the area S of the central part of Venn diagram and P (ABC)

We found that the change of the area S is the same as P (ABC) : with the P(ABC) increasing, area S has a regular increase, and the curve is almost the same. Although with P (ABC) increasing S have some fluctuations, in the case of test number increasing the fluctuation of S becomes smaller. In Figure 8, the curve of the area S is very broad, indicating that the data has a very large fluctuation. In Figure 9, the curve of the area S becomes narrower, indicating the volatility decreases. In Figure 0, the curve of the area S is more narrow, indicating that the amplitude of fluctuation is further reduced. To find the true relationship between P (ABC) and S, by the linear regression analysis, we conclude that: P ks ks When n = 00000000, k = 0.6, the curve of ks ks almost coincides exactly with the curve of P (ABC) (Figure ): Figure When k = 0.6, the curves of ks ks and P(ABC) are almost coincide

Therefore, we give an important hypothesis that with the infinitely increase of test number, the change laws of S and P(ABC) are the same, which mean the area S can represent P (ABC). So, we come to an important conclusion: although S and P (ABC) are not exactly the same, their sizes are positively correlated. Further, the more the number of tests, the more identical the change trend are. Unfortunately, we are unable to prove that when the number of tests towards infinity, S and P (ABC) are completely the same, and there is no fluctuation. Moreover, we cannot give a theoretical precise explanation for ks ks and only guess the larger the central part, the more data belongs to P(ABC) and the more possibility of the simultaneous occurrence of data points. Experiment show that, P(ABC) may only depend on P (A), P(C) and P (AB). So without knowing P( A B C), we can also get the relative size of P (ABC). The above method for three events can be extended to four events, five events and even more events. The quadruple P(ABCD ) can be seen as the joint probability of the three probabilities P (AB), P(C) and P (D). Known P (AB), P (C), P(D) P (ABC), P (ABD), P (CD) and using the above method, we can figure out P (ABCD). Therefore, the overall probability of one event can be obtained by the probability of its part, for example, the quality rate of a product depends on the quality rate of all its parts; the probability for a lottery to win depends on the correct probability of each number. and Reference [] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean. 007. Large Language Models in Machine Translation. In: Proceedings of 45st Annual Meeting of the Association for Computational Linguistics (ACL), pages 858-867. [] William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery. 007. Numerical Recipes: The Art of Scientific Computing (rd ed.). New York: Cambridge University Press.

[] Xiaojin Zhu and Ronald Rosenfeld. 00. Improving Trigram Language Modeling with the World Wide Web. IEEE International Conference on Acoustics, Speech and Signal, pages 5-56. [4] Derick Wood. 987. Theory of Computation. Harper & Row Publishers. [5] Adam Lopez. 008. Statistical Machine Translation. ACM Computing Surveys, 40():-49. 4