Language Acquisition and Parameters: Part II Matilde Marcolli CS0: Mathematical and Computational Linguistics Winter 205
Transition Matrices in the Markov Chain Model absorbing states correspond to local maxima (unique absorbing state at the target gives learnability) probability matrix of a Markov Chain T = (T ij ) with T ij = probability P(s i s j ) of moving from state s i to state s j absorbing states: rows of T with only one entry and all zeros powers T m of probability matrix: entry Tij m = probability of going from state s i to state s j in exactly m steps probabilities of reaching s j in the limit with initial state s i : T = lim m T m
Transition Matrix for the 3-parameter example T = /2 /6 /3 3/4 /2 /6 /2 /2 /6 5/6 5/8 2/3 /8 /2 /36 8/9 local maxima problem: two absorbing states
Limiting probabilities matrix T = for the 3-parameter example 2/5 3/5 2/5 3/5 for learnability irrespective of initial state would need column of s at the target state here if starting at s 2 or s 4 end up at s 2 (local maximum) instead of target s 5 ; initial states s 5, s 6, s 7, s 8 converge to correct target s 5 starting at s or s 3 will reach true target s 5 with probability 3/5 and false target s 2 with probability 2/5
the last case in the example shows that there are initial states for which there is a triggered path to the target, but the learner does not take that path with probability, only with a smaller probability if take same 3-parameter model but with target state s and transition matrix /6 5/6 5/8 2/3 /8 T = 3/36 /36 8/9 /3 23/36 /36 5/36 3/36 /8 /2 /36 /8 7/8 then no other local maxima and T has first column of s
Eigenvalues and eigenvectors of transition matrices matrix T is stochastic: T ij 0 and j T ij = for all i Perron Frobenius theorem: if T is irreducible (some power T m has all entries Tij m > 0) then - spectral radius ρ(t ) = = PF eigenvalue - PF eigenvalue is simple - PF (left) eigenvector v with all v i > 0 (uniform: v i = ) - period h = lcd{m : Tii m > 0} number of eigenvalues λ = but irreducible condition means graph strongly connected: every vertex is reachable from every other vertex... in general does not happen with Markov chains: in general T not irreducible
non-irreducible transition matrices T of a Markov Chain λ = is always an eigenvalue all other eigenvalues have λ < multiplicity of λ = is number of closed classes C i in decomposition of the Markov Chain if T has a basis of linearly independent left eigenvectors v i with v i T = λ i v i (and w i right eigenvectors T w i = λ i w i ) T m = i λ m i w i v i linear combination of matrices w i v i (independent of m) with coefficients λ m i
initial probabilities π i 0, with i π i =, of having state s i as initial state after m steps: π (m) i limiting distribution: π ( ) i = j π jt m ji = j π j T,ji = lim m j π j T m ji probability of learner approaching state s i in the limit if target state (say s ) is learnable, then π ( ) = and π ( ) i = 0 for i
Rate of convergence rate of convergence of π (m) i to π ( ) i is rate of convergence of T m to T, which is rate of convergence of λ i 0 for λ i < π (m) π ( ) = λ m i πw i v i max { λ i m } πw i v i i 2 i 2 i 2 estimate rate of decay of second largest eigenvalue
Summary construct Markov Chain for Parameter space 2 compute transition matrix T 3 compute eigenvalues 4 if multiplicity of eigenvalue λ = is more than one: target is unlearnable (local maxima problem) 5 if multiplicity one, check if basis of independent eigenvectors 6 if yes, find rate of decay of second largest eigenvalue: learnability at that speed 7 if not, project onto subspaces of lower dimension
Markov Chains and Learning Algorithms how broad is the Markov Chain method in modeling learning algorithms? suppose given A : D H; hypotheses h n = G and h n+ = G probability of passing from A(τ n ) = h n = G to A(τ n+ ) = h n+ = G at n + -st input is measure of set A n,g A n+,g = {τ A(τ n ) = G} {τ A(τ n+ ) = G } measure with respect to µ on A ω determined by µ on A (supported on target language L for positive examples only) P(h n+ = G h n = G) = µ (A n,g A n+,g ) µ (A n,g ) assuming µ (A n,g ) > 0
Inhomogeneous Markov Chain state space = set of possible grammars H = set of possible (binary) syntactic parameters Transition matrix at n-th step: T n (s, s ) = P(s s ) = P(A(τ n+ ) = G s A(τ n ) = G s ) = µ (A n,gs A n+,gs ) µ (A n,gs ) these satisfy s T n(s, s ) = for all s to define T n (s, s ) also when µ (A n,gs ) = 0 take a set of α s > 0 with s α s = and set T n (s, s ) = α s, when µ (A n,gs ) = 0. the transition matrix T = T n is time dependent
Conclusion: any deterministic learning algorithm A : D H can be modeled by an inhomogeneous Markov Chain the inhomogeneous Markov Chain depends on A, on the target language L G and on the measure µ memoryless learner hypothesis: A(τ n+ ) = F (A(τ n ), τ(n + )) T n (s, s ) = T (s.s ) = µ({x A F (s, x) = s } memory limited learning algorithms: m-memory limited if A(τ n ) only depends on last m sentences in text τ m and the previous grammatical hypothesis Fact: if H learnable by a memory limited algorithm, in fact learnable by a memoryless algorithm