Models of Language Evolution

Similar documents
Models of Language Evolution: Part II

Language Acquisition and Parameters: Part II

Language Learning Problems in the Principles and Parameters Framework

Models of Language Acquisition: Part II

v n+1 = v T + (v 0 - v T )exp(-[n +1]/ N )

Lecture2 The implicit function theorem. Bifurcations.

Probabilistic Linguistics

CS1800: Strong Induction. Professor Kevin Gold

Learning Context Free Grammars with the Syntactic Concept Lattice

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

2 One-dimensional models in discrete time

Bringing machine learning & compositional semantics together: central concepts

Stochastic Simulation.

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Algebra 1. Standard 1: Operations With Real Numbers Students simplify and compare expressions. They use rational exponents and simplify square roots.

A population is modeled by the differential equation

Algebra 2 Notes AII.7 Polynomials Part 2

CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis. Hunter Zahn Summer 2016

The shortest queue problem

7.1 Coupling from the Past

Chapter 4: Computation tree logic

Jim Lambers MAT 460 Fall Semester Lecture 2 Notes

Formal Languages 2: Group Theory

Improved TBL algorithm for learning context-free grammar

1.2. Direction Fields: Graphical Representation of the ODE and its Solution Let us consider a first order differential equation of the form dy

Lecture 4 The stochastic ingredient

The efficiency of identifying timed automata and the power of clocks

Advanced Mathematics Unit 2 Limits and Continuity

Advanced Mathematics Unit 2 Limits and Continuity

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

3F1 Information Theory, Lecture 3

Advanced Natural Language Processing Syntactic Parsing

... it may happen that small differences in the initial conditions produce very great ones in the final phenomena. Henri Poincaré

Geometry of Phylogenetic Inference

8.1 Polynomial Threshold Functions

ACCUPLACER MATH 0311 OR MATH 0120

CHAOS. Verhulst population model

Anatomy of a Bit. Reading for this lecture: CMR articles Yeung and Anatomy.

Math 345 Intro to Math Biology Lecture 7: Models of System of Nonlinear Difference Equations

Replacing the a in the definition of the derivative of the function f at a with a variable x, gives the derivative function f (x).

PART 1: USING SCIENTIFIC CALCULATORS (41 PTS.) 1) The Vertex Form for the equation of a parabola in the usual xy-plane is given by y = 3 x + 4

A.I.: Beyond Classical Search

Algebra I. Mathematics Curriculum Framework. Revised 2004 Amended 2006

Lecture 20: Lagrange Interpolation and Neville s Algorithm. for I will pass through thee, saith the LORD. Amos 5:17

Ch. 7.6 Squares, Squaring & Parabolas

Algebra II Learning Targets

CS154 Final Examination

THE KENNESAW STATE UNIVERSITY HIGH SCHOOL MATHEMATICS COMPETITION PART II Calculators are NOT permitted Time allowed: 2 hours

Lecture3 The logistic family.

Evolutionary Dynamics of Lewis Signaling Games: Signaling Systems vs. Partial Pooling

MODEL ANSWERS TO THE SEVENTH HOMEWORK. (b) We proved in homework six, question 2 (c) that. But we also proved homework six, question 2 (a) that

Counting independent sets up to the tree threshold

Numerical Methods. Exponential and Logarithmic functions. Jaesung Lee

Numerical analysis meets number theory: using rootfinding methods to calculate inverses mod p n

THE ISLAMIC UNIVERSITY OF GAZA ENGINEERING FACULTY DEPARTMENT OF COMPUTER ENGINEERING DISCRETE MATHMATICS DISCUSSION ECOM Eng. Huda M.

MODEL ANSWERS TO THE THIRD HOMEWORK. 1. We prove this result by mathematical induction. Let P (n) be the statement that

West Windsor-Plainsboro Regional School District Algebra II Grades 9-12

CS154 Final Examination

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Consider the equation of the quadratic map. x n+1 = a ( x n ) 2. The graph for 0 < a < 2 looks like:

Introduction to Dynamical Systems Basic Concepts of Dynamics

Homework 4 Solutions, 2/2/7

Random Times and Their Properties

1 Sequences and Summation

MAT335H1F Lec0101 Burbulla

Exercises 1 - Solutions

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Rational Functions. A rational function is a function that is a ratio of 2 polynomials (in reduced form), e.g.

Objective. Single population growth models

Haploid & diploid recombination and their evolutionary impact

Homework #2 Solutions Due: September 5, for all n N n 3 = n2 (n + 1) 2 4

Solutions to homework Assignment #6 Math 119B UC Davis, Spring 2012

Lecture Notes on Inductive Definitions

Math 504, Fall 2013 HW 2

Analysis of the 500 mb height fields and waves: testing Rossby wave theory

Latent voter model on random regular graphs

Lecture 5 Neural models for NLP

Virginia Unit-Specific Learning Pathways. Grades 6-Algebra I: Standards of Learning

Building socio-economic Networks: How many conferences should you attend?

Computational Learning Theory. CS534 - Machine Learning

Structural Induction

CS Lecture 29 P, NP, and NP-Completeness. k ) for all k. Fall The class P. The class NP

Learning with multiple models. Boosting.

Mathematics for linguists

Mathematical and Computational Methods for the Life Sciences

Algorithms for Syntax-Aware Statistical Machine Translation

Statistical Structure in Natural Language. Tom Jackson April 4th PACM seminar Advisor: Bill Bialek

CS 360, Winter Morphology of Proof: An introduction to rigorous proof techniques

Link Models for Circuit Switching

Problem Set Number 02, j/2.036j MIT (Fall 2018)

Exam 2. Principles of Ecology. March 10, Name

MIT Manufacturing Systems Analysis Lectures 6 9: Flow Lines

Chapter 2 Chaos theory and its relationship to complexity

Everything You Always Wanted to Know About Parsing

Compiling Techniques

Symmetries and Polynomials

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Exercise 3 Exploring Fitness and Population Change under Selection

1 Examples of Weak Induction

Transcription:

Models of Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015

Main Reference Partha Niyogi, The computational nature of language learning and evolution, MIT Press, 2006.

From Language Acquisition to models of language acquisition behind transmission mechanism (how language gets transmitted to next generation of learners) perfect language acquisition implies perfect language transmission... but for evolution need imperfect transmission phonological, morphological, syntactic and semantic changes are observed points of view imported from evolutionary biology: population dynamics, genetic drift, dynamical systems models language learning at individual level versus population level

Learning algorithm: computable function A : D H from primary linguistic data to a space of grammars 1 Grammatical Theory: determines H 2 Acquisition Model: determines A Possibilities for change in the language transmission 1 The data D are changed 2 The data D are insufficient First case: presence of mixed population of speakers of different languages (not all data D consistent with same language) Second case: after finite τ m algorithm gives hypothesis h m = A(τ m ) at some distance from target

Toy Model: two competing languages L 1, L 2 A with L 1 L 2 sentences in L 1 L 2 are ambiguous and can be parsed by both G 1 and G 2 grammars assume each individual in the population is monolingual α t = percentage of population at time t (or number of generations) that speaks L 1 and (1 α t ) = percentage that speaks L 2 P i = probability distribution for sentences of L i learning algorithm A : D H = {G 1, G 2 } computable data are drawn according to a probability distribution P on A

probability that learning algorithm will guess L 1 after m inputs p m = P(A(τ m ) = L 1 ) p m = p m (A, P) depends on learning algorithm and distribution P if P on A is supported on L 1 (so P = P 1 ) lim p m(a, P = P 1 ) = 1 m in this case G 1 is the target grammar the learning algorithm converges to

Population Dynamics of the two languages model assume size K of data τ K after which linguistic hypothesis stabilizes (locking set) with probability p K (A, P 1 ) the language acquired will be L 1 with probability 1 p K (A, P 1 ) it will be L 2 so the new generation will have fraction p K (A, P 1 ) of speakers of language L 1 and fraction 1 p K (A, P 1 ) of speakers of L 2

assume the proportion for a given generation are α and 1 α the following generation of learners will then receive examples generated with probability distribution P = αp 1 + (1 α)p 2 the following generation will then result in a population of speakers with distribution λ and 1 λ where λ = p K (A, αp 1 + (1 α)p 2 ) this gives the recursive dependence λ = λ(α) in language transmission to the following generation

Assumptions made in this model new learners (new generation) receive input from entire community of speakers (previous generation) in proportion to the language distribution across the population the probabilities P 1, P 2 of drawing sentences in L 1, L 2 do not change in time learning algorithm constructs a single hypothesis after each input populations can have unlimited growth

Memoryless Learner with two languages model Initialize: randomly choose initial hypothesis G 1 or G 2 Receive Input s i : if current hypothesis parses s i get new input, if not next step Single-Step Hill Climbing: switch to other hypothesis (in space of two languages) and receive new input how does population evolve if this Trigger Learning Algorithm is used?

set values (given P 1 and P 2 ): a = P 1 (L 1 L 2 ), 1 a = P 1 (L 1 L 2 ) b = P 2 (L 1 L 2 ), 1 b = P 2 (L 2 L 1 ) a and b are the probabilities of users of languages L 1 and L 2 of generating ambiguous sentences assume a very short maturation time : K = 2 Result: the ratio α t+1 of the t + 1-st generation satisfies α t+1 = A α 2 t + B α t + C A = 1 2 ((1 b)2 (1 a) 2 ), B = b(1 b) + (1 a), C = b2 2

Explanation: start with α t proportion of L 1 -users compute probability of learner acquiring L 1 in two steps (K = 2) probabilities for a random example: in L 1 L 2 with probability α t (1 a) in L 1 L 2 with probability α t a + (1 α t )b in L 2 L 1 with probability (1 α t )(1 b) also probability 1/2 of choosing L 1 as initial hypothesis

if started with L 1, to have L 1 after two steps: either L 1 retained in both steps or switch from L 1 to L 2 at next step and back from L 2 to L 1 at second step first case happens with probability α t + (1 α t )b second case happens with probability α t (1 a)(1 α t )(1 b)

if started with L 2, to have L 1 in two steps: either switch to L 1 at first step and retain L 1 at second or retain L 2 at first and switch to L 1 at second the first case happens with probability α t (1 a)(α t + (1 α t )b) the second case happens with probability ((1 α t ) + α t a)α t (1 a) putting all these possibilities together gives the right counting

Long term behavior if a = b simple exponential growth for a b behavior similar to logistic map: in particular it has a regime with chaotic behavior the chaotic regime is avoided because of the constraints a, b 1 the fact that the recursion is a quadratic function reflects the choice K = 2 for higher values of K would get higher order polynomials

Result: for an arbitrary K α t+1 = B + 1 2 (A B)(1 A B)K A + B A = (1 α t )(1 b), B = α t (1 a) Explanation: Markov Chain with two states describing the TLA Transition matrix T T 12 = (1 α t )(1 b) = A, T 21 = α t (1 a) = B and T 11 = 1 T 12 and T 22 = 1 T 21 after m examples moved by transition matrix T m

probability of acquiring language L 1 after m examples is recursively have T m = TT m 1 1 2 (T m 11 + T m 21) T m 11 = (1 A)T m 1 11 + BT m 1 12 T m 11 = B A + B similarly obtain inductively T m 21 = B A + B + A(1 A B)m A + B + B(1 A B)m A + B putting these together gives succession rule at m = K

Population behavior in the model the function f (α) = f a,b,k (α) f a,b,k (α) = B(α) + 1 2 (A(α) B(α))(1 A(α) B(α))K A(α) + B(α) A(α) = (1 α)(1 b), B(α) = α(1 a) only one stable fixed point in α [0, 1] interval f (0) = b K /2 and f (1) = a K /2, f continuous, find only one α = f (α) and can check at that point f (α) < 1 if a = b = 1/2 fixed point is at α = 1/2 (population converges to this mix from all initial conditions) f 1 2, 1 2,K (α) = α(1 bk ) + bk 2

if a b with a > b: fixed point close to α = 0: most population speaks L 2 if a b with a < b: fixed point close to α = 1: most population speaks L 1 transition of the fixed point from a value close to zero to a value close to one very sharp for small values of a, b, more gradual for larger values of a, b (close to one)

Limiting behavior when K limiting function and recursion f a,b, (α) = α(1 a) α(1 a) + (1 α)(1 b) f a,b, (α) = (1 a)(1 b) ((1 b) + α(b a)) 2 α t+1 = f a,b, (α t ) if a = b just have α t+1 = α t population preserved, no change fixed point behavior: if a > b two fixed points α = f a,b, (α) at α = 0 (unstable) and α = 1 (stable) if a < b same two fixed points but with switched stability

Batch Error-Based Learner in the two languages model still memoryless learner, but replace trigger learning algorithm (TLA) with batch error-based learner waits until all set of K samples collected before choosing a hypothesis, then pick the one that best fits the entire set τ K = (s 1,..., s K ) for each L i error-measure e(l i ) = k i K with k i = number of sentences in τ K that cannot be parsed by L i then hypothesis is chosen as A(τ K ) = arg min i e(l i )

Procedure Group together sentences in τ K = (s 1,..., s K ) into 1 n 1 sentences in L 1 L 2 2 n 2 sentences in L 1 L 2 3 n 3 sentences in L 2 L 1 with n 1 + n 2 + n 3 = K Choose L 1 if n 1 > n 3 ; choose L 2 if n 3 > n 1 if n 1 = n 3 deterministic or randomized way of choosing either L i Example: choose L 1 if n 1 n 3

Result: population dynamics α t+1 = f a,b,k (α t ) with f a,b,k (α) = ( ) K p 1 (α) n 1 p 2 (α) n 2 p 3 (α) n 3 n 1 n 2 n 3 with sum over (n 1, n 2, n 3 ) Z 3 + with n 1 + n 2 + n 3 = K and n 1 n 3 p 1 (α) = α(1 a), p 2 (α) = αa+(1 α)b, p 3 (α) = (1 α)(1 b) Properties of Dynamics b = 1 p 3 (α) = 0; a = 1 p 1 (α) = 0 have 1 a = P 1 (L 1 L 2 ) and 1 b = P 2 (L 2 L 1 ) so b = 1 implies n 3 = 0 and a = 1 implies n 1 = 0 so for b = 1 always n 1 n 3 so always L 1

for a = 1 have n 1 n 3 only when n 3 = 0 so get f 1,b,K (α) = (1 (1 α)(1 b)) K then α = 0 not a fixed point but α = 1 is fixed stability of α = 1 fixed point depends on K and b: stability iff b > 1 1 K when passes to unstable, bifurcation occurs and new (stable) fixed point appears in interior of interval (0, 1) when a 1 and b 1: for most values α = 1 stable fixed point, and two fixed points α 1 < α 2 in (0, 1), first stable, second unstable

Asymptotic Behavior when K assume K with n 1 K p 1 and n 3 K p 3 then if p 1 > p 3 have α(1 a) > (1 α)(1 b) and α t 1 when K = have α = 0 and α = 1 stable fixed points and unstable 1 b α = (1 b) + (1 a) Note: asymmetry of behavior when n 1 = n 3 (choosing L 1 ) becomes less and less noticeable in the large K limit

Cue-Based Learner in the two languages model learner examines data for indications of how to set linguistic parameters a set C L 1 L 2 of examples that are cues to target being L 1 if elements from C occur sufficiently frequently in D learning algorithm chooses L 1, if not it chooses L 2 learner receives K samples input τ K = (s 1,..., s K ) k/k = fraction of the input that is in the cue set probability that a user of language L 1 produces a cue: p = P 1 (C) probability that learner receives a cue as input = α t p threshold t with k/k > t: achieved with probability ( ) K (pα t ) i (1 pα t ) K i i where sum is over i in the range Kt i K

Population Dynamics with cue-based learner recursion relation for the fractions of population speaking the two languages: α t+1 = f p,k (α t ) = ( ) K (pα t ) i (1 pα t ) K i i Kt i K when p = 0 cues never produced, only stable equilibrium is α = 0 (reached in one step) for p small, α = 0 stays unique stable fixed point as p increases bifurcation occurs: two new fixed points arise α 1 < α 2 α = 0 remains stable; α 1 is unstable; α 2 is stable at p = 1 stable fixed points α = 0 and α = 1 and one unstable fixed point in between

Fixed Point Analysis (more details) fixed points α = f (α) of function f p,k (α) = ( ) K (pα) i (1 pα) K i i Kt i K for all p and K have f p,k (0) = 0, for stability check f (0) < 1: f p,k (α) = pf (pα) with f p,k (α) = F (pα) F (α) = ( ) K α i (1 α) K i i Kt i K

differentiate term by term gives F (α): ( ) K (kα k 1 (1 α) K k (K k)α k (1 α) K k 1) +Kα K 1 k K t k K 1 with K t smallest integer larger than Kt Expanding and grouping terms F (α) = K (K 1)! (K k)!(k 1)! αk 1 (1 α) K k K K t k K 1 K t k K 1 (K 1)! k!(k k 1)! αk (1 α) K k 1 α K 1

cancellations leave F (α) = K ( ) K 1 α Kt 1 (1 α) K Kt K t 1 f p,k (0) = pf (0) = 0 hence stability of α = 0 f p,k (1) = F (p) < 1 (for p < 1); since f p,k (0) = 0 with f p,k (0) = 0 and continuous: even number of crossings of graph of f p,k and diagonal in (0, 1] if 2m such points α 1,..., α 2m with slope f p,k (α j) alternating larger and smaller than 1 (slope of diagonal) is each successive interval (α 2j 1, α 2j+1 ) derivative f p,k changes from larger to smaller to larger than 1, so f p,k (α) 1 changes sign twice, so derivative f p,k has zero, same for every interval (α 2j 2, α 2j )

second derivative f p,k (α) = p2 F (pα) show that f p,k vanishes at most once in (0, 1) at most two fixed points in (0, 1] in fact have ( ) K 1 F (α) = K α Kt 2 (1 α) K Kt 1 (K t 1 (K 1)α) K t 1 Limiting Behavior for K (with k/k pα) if pα < t all learners choose L 2 if pα > t all learners choose L 1 for p < t (hence pα < t for all α [0, 1]) only stable fixed point α = 0 for p > t two stable fixed points α = 0 and α = 1 with basins of attraction α 0 [0, t/p) and α 0 (t/p, 1] in this model a change from L 1 to L 2 (or vice versa) achieved by moving p across threshold