Models of Language Evolution: Part II

Size: px

Start display at page:

Download "Models of Language Evolution: Part II"

Alexander Wilkerson
6 years ago
Views:

1 Models of Language Evolution: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015

2 Main Reference Partha Niyogi, The computational nature of language learning and evolution, MIT Press, 2006.

3 Multiple Language Models H = {G 1,..., G n } a space with n grammars corresponding languages L 1,..., L n A on alphabet A probability measure P on A according to which sentences are drawn in data set D speakers of language L i draw sentences according to probability P i with support on L i A (positive examples) distribution P is a weighted combination P = i α i P i where α i = α i,t are the fractions of population (time/generation t) that speak L i, with i α i = 1 learning algorithm A : D H computable function

4 probabilistic convergence: grammar G i is learnable if lim P(A(τ m) = G i ) = 1 m when τ m examples in L i drawn according to P i Population Dynamics State Space: S = space of all possible linguistic compositions of the population identify states s S with possible probability distributions P = (P G ) on H...identify with previous α = (α i ) Example: in 3-parameter model (with 8 possible grammars) have S = {P = (P i ) i=1,...,8 P i 0, i P i = 1} = 7 then distribution on A is P(x) = i P i P i (x)

5 finite sample: probability of formulating a certain hypothesis after a sample of size m p m (G i ) := P(A(τ m ) = G i ) limiting sample: limiting behavior for sample size m p(g i ) := lim m P(A(τ m) = G i ) given at generation/time t a distribution P t S of speakers of the different languages get recursion relation P t+1 = F (P t ) take P t+1,i = p m (G i ) in finite sample case, or P t+1,i = p(g i ) in limiting sample case (P t here determines P)

6 Markov Chain Model specify grammars G i through their syntactic parameters (n = 2 N number of possible settings of parameters) trigger learning algorithm as Markov Chain with 2 N nodes T = transition matrix of the Markov Chain then probabilities p m (G i ) = P(A(τ m ) = G i ) p m (G i ) = (2 N 1 2 N T m ) i i-th component of vector obtained by applying (on the right) matrix T m to normalized row vector 2 N 1 2 N (all components 2 N ) assuming starting with uniform distribution 2 N 1 2 N limiting distribution p(g i ) with T S

7 Interpreting limiting behavior seen in case of Markov Chain Model, if non-unique closed class in Markov Chain decomposition, limiting T matrix can have initial states with different probabilities of reaching different targets Example of matrix T in 3-parameter model seen before 2/5 3/5 1 2/5 3/5 T = starting at s 1 or s 3 will reach s 5 with probability 3/5 and s 2 with probability 2/5 interpret these probabilities as limiting composition of speakers population

8 3-parameter model with homogeneous initial population assume initial population consists only of speakers of one of the languages L i (that is, initial P has P i = 1 and all other P j = 0) after a number of generations observe drift towards other languages: population no longer homogeneous (or remains homogeneous, depending on initial state) at next generation p m (G i ) = (PT m ) i or p(g i ) = (PT ) i with P the initial distribution then this gives new P = (P i ) and recompute p m and p with this P for next generation, etc.

9 Shortcomings of the model simulations in case of 3-parameter model show: +V2-languages remain stable; -V2-languages all drift towards +V2-languages contrary to what known from historical linguistics: languages tend to lose the +V2-parameter (switching to -V2): both English and French historically switched from +V2 to -V2 model relies on many ad hoc assumptions about the state space, the learning algorithm, the probability distributions... some of these are clearly not realistic/adequate to actual language evolution Problem: identify which assumptions are inadequate and modify them and improve the model

10 Can model the S-shape? An observation of Historical Linguistics: S-shape of Language Change a certain change in a language begins very gradually, then proceeds at a much faster rate, then tails off slowly again in evolutionary biology, a logistic shaped curve governs replacement of organisms and of genetic alleles that differ in Darwinian fitness in the case of linguistic change, what mechanism would produce this kind of shape? in the model of language evolution have dependence on maturation time K and probability distribution P (a, b parameters in the 2-languages case; distribution P in multilingual case)

11 simulations in 2-languages model finds: initial rate of language change is highest when K small curves not S-shaped: if start with homogeneous L 1 population, percentage of L 2 speakers with generations first grows to above limiting value, then decreases slopes and location of the peak and of asymptotic value depend on K not clear is these are spurious phenomena due to assumptions in the model, or they say something more about the S-shape empirical observation of historical linguists

13 Effect of the dependence on P i probability distributions P i with which sentences in L i are generated these are the underlying parameters of the dynamical system seen already in 2-languages model, as dependence on a, b parameters if think of these parameters as changing in time, can affect a language change by (gradual) modification of the parameters can use for reverse engineering : if know a certain change occurs, find what diachronic modification of the parameters is needed in order to affect that change, then use to model other changes

14 3-parameter model with non-homogeneous initial population P = (P i ) i=1,...,8 probability distribution on the 8 possible grammars of the 3-parameter model P 7 point in 7-dimensional simplex in R 8 + T = transition matrix of the Markov Chain of 3-parameter model given P t (with P 0 = P) this determines P = P t on A which determines entries of T = T t non-homogrneous Markov Chain with T = T t for finite-sample size m use matrix T m t recursion step: next generation distribution P t+1 = P t T m t

15 Stability Analysis given distributions P i used in computing transition matrix T T ij = P(s i s j ) = x A : A(s i,x)=s j P(x) A(s, x) determines algorithm s next hypothesis, given current state of Markov Chain s and input x let T i be transition matrix when P = P i (target language is L i ) data D all coming from L i drawn according to P i (instead of mixture of languages with combination P) at generation/time t new distribution will be P = P t = i P t,i P i, where P t distribution of languages at generation t

16 get T t transition matrix by (T t ) ab = P t,i P i (x) = i i x A : A(s a,x)=s b P t,i (T i ) ab fixed points are solutions of P t+1 = P t finite-sample case size m: P = (P i ) satisfying P = ( i P i T i ) m limiting case: can show it becomes solution P = (P i ) of P = 1 8 (I 8 8 i P i T i ) 1 where I N N identity and 1 N N matrix with all entries 1 assuming initial distribution 1/8 1 8

17 Multilingual Learners realistically, leaners in a multilingual population do not zoom in on a unique grammar, but become multilingual themselves (even if in previous generation each individual speaks only one language) a more realistic model should take this into account instead of learning algorithm A : D H consider as a map A : D M(H) with M(H) = set of probability measures on H if H has n elements, identify M(H) = n 1 simplex

18 Bilingualism learning algorithm A : D [0, 1] a bilingual speaker (languages L 1 and L 2 ) produces sentences x A according to probability with P i supported on L i A P λ (x) = λp 1 (x) + (1 λ)p 2 (x) for a single speaker a particular value λ is fixed over the entire population it varies according to a distribution P(λ) over [0, 1]

19 probability of a sentence x A being produced, when input comes from entire population P(x) = P 1 (x) P(x) = 1 with expectation value P λ (x) P(λ) dλ λ P(λ) dλ + P 2 (x) 1 0 (1 λ) P(λ) dλ P(x) = E P (λ) P 1 (x) + (1 E P (λ)) P 2 (x) E P (λ) = 1 0 λ P(λ) dλ so input from population with probabilities P λ distributed according to P(λ) same as input from single speaker with probability P EP (λ)

20 learner (next generation) will also acquire bilingualism with a factor λ which is deduced (via the learning algorithm) from the incoming data again subdivide m input data sentences into groups 1 n 1 sentences in L 1 L 2 2 n 2 sentences in L 1 L 2 3 n 3 sentences in L 2 L 1 with n 1 + n 2 + n 3 = m three possible learning algorithms 1 A 1 sets λ = n 1 n 1 +n 3 (ignoring ambiguous sentences) 2 n A 1 sets λ = 1 n 1 +n 2 +n 3 (ambiguous sentences read as in L 2 ) 3 A 3 sets λ = n 1+n 2 n 1 +n 2 +n 3 (ambiguous sentences read as in L 1 )

21 at time/generation t set u t := E P (λ) average over population in that generation recursive relation u t+1 from u t in terms of parameters a = P 1 (L 1 L 2 ) b = P 2 (L 1 L 2 ) different weight put on the ambiguous sentences by the probability distributions P i of the two languages Result: recursion for the three algorithms A 1 : u t+1 = A 2 : u t+1 = u t (1 a) A 3 : u t+1 = u t (1 b) + b u t (1 a) u t (1 a) + (1 u t )(1 b)

22 Explanation case of A 2 : n 1 u t+1 = E( ) = E(n 1) n 1 + n 2 + n 3 m = P 1(L 1 L 2 )E P (λ) = (1 a)u t case of A 3 : n 1 + n 2 u t+1 = E( ) = E(n 1) n 1 + n 2 + n 3 m + E(n 2) m = u t (1 a) + u t a + (1 u t )b

23 case of A 1 : n 1 u t+1 = E( ) = ( ) m α n 1 β n 2 γ n n 3 1 n 1 + n 3 n 1 n 2 n 3 n 1 + n 3 with α = (1 a)u t, β = au t + b(1 u t ), γ = (1 b)(1 u t ) u t+1 = = m k=0 m k=0 n 1 =0 = ( m k k ( k n 1 )( ) m α n 1 β m k γ k n n 1 1 k k ) β m k (1 β) k α 1 β = α 1 β u t (1 a) u t (1 a) + (1 u t )(1 b)

Models of Language Evolution

Models of Language Evolution Models of Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Main Reference Partha Niyogi, The computational nature of language learning and evolution, MIT Press, 2006. From