Language Acquisition and Parameters: Part II

Similar documents
Models of Language Evolution

Models of Language Evolution: Part II

Language Learning Problems in the Principles and Parameters Framework

Models of Language Acquisition: Part II

Perron Frobenius Theory

Markov Chains, Stochastic Processes, and Matrix Decompositions

Markov Chains, Random Walks on Graphs, and the Laplacian

Markov Chains and Stochastic Sampling

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

Probabilistic Linguistics

MAA704, Perron-Frobenius theory and Markov chains.

Geometry of Phylogenetic Inference

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Markov Processes. Stochastic process. Markov process

Math 1553, Introduction to Linear Algebra

Kernels of Directed Graph Laplacians. J. S. Caughman and J.J.P. Veerman

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

MATH36001 Perron Frobenius Theory 2015

Lecture 15 Perron-Frobenius Theory

18.600: Lecture 32 Markov Chains

Stochastic Processes

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

Detailed Proof of The PerronFrobenius Theorem

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Link Analysis. Leonid E. Zhukov

STOCHASTIC PROCESSES Basic notions

Problem # Max points possible Actual score Total 120

ECEN 689 Special Topics in Data Science for Communications Networks

Finite-Horizon Statistics for Markov chains

Stochastic modelling of epidemic spread

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

18.440: Lecture 33 Markov Chains

Probability & Computing

Z-Pencils. November 20, Abstract

Family Feud Review. Linear Algebra. October 22, 2013

Convex Optimization CMU-10725

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Final. for Math 308, Winter This exam contains 7 questions for a total of 100 points in 15 pages.

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

Understanding MCMC. Marcel Lüthi, University of Basel. Slides based on presentation by Sandro Schönborn

Notes on Linear Algebra and Matrix Theory

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Markov chains. 1 Discrete time Markov chains. c A. J. Ganesh, University of Bristol, 2015

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Section 3.9. Matrix Norm

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Spectra of Adjacency and Laplacian Matrices

Calculating Web Page Authority Using the PageRank Algorithm

Introduction to Machine Learning CMU-10701

Periodicity & State Transfer Some Results Some Questions. Periodic Graphs. Chris Godsil. St John s, June 7, Chris Godsil Periodic Graphs

Spectral radius, symmetric and positive matrices

Agreement algorithms for synchronization of clocks in nodes of stochastic networks

Eigenvalues in Applications

Social network analysis: social learning

LINEAR ALGEBRA QUESTION BANK

Statistics 992 Continuous-time Markov Chains Spring 2004

On Distributed Coordination of Mobile Agents with Changing Nearest Neighbors

In particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with

Lecture 5: Random Walks and Markov Chain

Section 1.7: Properties of the Leslie Matrix

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Reinforcement Learning

Online solution of the average cost Kullback-Leibler optimization problem

Combinatorial Types of Tropical Eigenvector

Math 304 Handout: Linear algebra, graphs, and networks.

Properties of Matrices and Operations on Matrices

Elementary Linear Algebra Review for Exam 2 Exam is Monday, November 16th.

Application. Stochastic Matrices and PageRank

Introduction General Framework Toy models Discrete Markov model Data Analysis Conclusion. The Micro-Price. Sasha Stoikov. Cornell University

spring, math 204 (mitchell) list of theorems 1 Linear Systems Linear Transformations Matrix Algebra

Eigenvectors Via Graph Theory

An Introduction to Entropy and Subshifts of. Finite Type

18.175: Lecture 30 Markov chains

The Google Markov Chain: convergence speed and eigenvalues

1 Inner Product and Orthogonality

Math Homework 5 Solutions

ITCT Lecture IV.3: Markov Processes and Sources with Memory

Lecture 21: Spectral Learning for Graphical Models

Mathematical foundations - linear algebra

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

UCSD ECE269 Handout #18 Prof. Young-Han Kim Monday, March 19, Final Examination (Total: 130 points)

eigenvalues, markov matrices, and the power method

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Approximation Algorithms

Perron eigenvector of the Tsetlin matrix

Review of Some Concepts from Linear Algebra: Part 2

Markov Chains. As part of Interdisciplinary Mathematical Modeling, By Warren Weckesser Copyright c 2006.

On the mathematical background of Google PageRank algorithm

Affine iterations on nonnegative vectors

ELEC633: Graphical Models

TMA Calculus 3. Lecture 21, April 3. Toke Meier Carlsen Norwegian University of Science and Technology Spring 2013

The Channel Capacity of Constrained Codes: Theory and Applications

Algebra Exam Topics. Updated August 2017

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

Stochastic modelling of epidemic spread

Transcription:

Language Acquisition and Parameters: Part II Matilde Marcolli CS0: Mathematical and Computational Linguistics Winter 205

Transition Matrices in the Markov Chain Model absorbing states correspond to local maxima (unique absorbing state at the target gives learnability) probability matrix of a Markov Chain T = (T ij ) with T ij = probability P(s i s j ) of moving from state s i to state s j absorbing states: rows of T with only one entry and all zeros powers T m of probability matrix: entry Tij m = probability of going from state s i to state s j in exactly m steps probabilities of reaching s j in the limit with initial state s i : T = lim m T m

Transition Matrix for the 3-parameter example T = /2 /6 /3 3/4 /2 /6 /2 /2 /6 5/6 5/8 2/3 /8 /2 /36 8/9 local maxima problem: two absorbing states

Limiting probabilities matrix T = for the 3-parameter example 2/5 3/5 2/5 3/5 for learnability irrespective of initial state would need column of s at the target state here if starting at s 2 or s 4 end up at s 2 (local maximum) instead of target s 5 ; initial states s 5, s 6, s 7, s 8 converge to correct target s 5 starting at s or s 3 will reach true target s 5 with probability 3/5 and false target s 2 with probability 2/5

the last case in the example shows that there are initial states for which there is a triggered path to the target, but the learner does not take that path with probability, only with a smaller probability if take same 3-parameter model but with target state s and transition matrix /6 5/6 5/8 2/3 /8 T = 3/36 /36 8/9 /3 23/36 /36 5/36 3/36 /8 /2 /36 /8 7/8 then no other local maxima and T has first column of s

Eigenvalues and eigenvectors of transition matrices matrix T is stochastic: T ij 0 and j T ij = for all i Perron Frobenius theorem: if T is irreducible (some power T m has all entries Tij m > 0) then - spectral radius ρ(t ) = = PF eigenvalue - PF eigenvalue is simple - PF (left) eigenvector v with all v i > 0 (uniform: v i = ) - period h = lcd{m : Tii m > 0} number of eigenvalues λ = but irreducible condition means graph strongly connected: every vertex is reachable from every other vertex... in general does not happen with Markov chains: in general T not irreducible

non-irreducible transition matrices T of a Markov Chain λ = is always an eigenvalue all other eigenvalues have λ < multiplicity of λ = is number of closed classes C i in decomposition of the Markov Chain if T has a basis of linearly independent left eigenvectors v i with v i T = λ i v i (and w i right eigenvectors T w i = λ i w i ) T m = i λ m i w i v i linear combination of matrices w i v i (independent of m) with coefficients λ m i

initial probabilities π i 0, with i π i =, of having state s i as initial state after m steps: π (m) i limiting distribution: π ( ) i = j π jt m ji = j π j T,ji = lim m j π j T m ji probability of learner approaching state s i in the limit if target state (say s ) is learnable, then π ( ) = and π ( ) i = 0 for i

Rate of convergence rate of convergence of π (m) i to π ( ) i is rate of convergence of T m to T, which is rate of convergence of λ i 0 for λ i < π (m) π ( ) = λ m i πw i v i max { λ i m } πw i v i i 2 i 2 i 2 estimate rate of decay of second largest eigenvalue

Summary construct Markov Chain for Parameter space 2 compute transition matrix T 3 compute eigenvalues 4 if multiplicity of eigenvalue λ = is more than one: target is unlearnable (local maxima problem) 5 if multiplicity one, check if basis of independent eigenvectors 6 if yes, find rate of decay of second largest eigenvalue: learnability at that speed 7 if not, project onto subspaces of lower dimension

Markov Chains and Learning Algorithms how broad is the Markov Chain method in modeling learning algorithms? suppose given A : D H; hypotheses h n = G and h n+ = G probability of passing from A(τ n ) = h n = G to A(τ n+ ) = h n+ = G at n + -st input is measure of set A n,g A n+,g = {τ A(τ n ) = G} {τ A(τ n+ ) = G } measure with respect to µ on A ω determined by µ on A (supported on target language L for positive examples only) P(h n+ = G h n = G) = µ (A n,g A n+,g ) µ (A n,g ) assuming µ (A n,g ) > 0

Inhomogeneous Markov Chain state space = set of possible grammars H = set of possible (binary) syntactic parameters Transition matrix at n-th step: T n (s, s ) = P(s s ) = P(A(τ n+ ) = G s A(τ n ) = G s ) = µ (A n,gs A n+,gs ) µ (A n,gs ) these satisfy s T n(s, s ) = for all s to define T n (s, s ) also when µ (A n,gs ) = 0 take a set of α s > 0 with s α s = and set T n (s, s ) = α s, when µ (A n,gs ) = 0. the transition matrix T = T n is time dependent

Conclusion: any deterministic learning algorithm A : D H can be modeled by an inhomogeneous Markov Chain the inhomogeneous Markov Chain depends on A, on the target language L G and on the measure µ memoryless learner hypothesis: A(τ n+ ) = F (A(τ n ), τ(n + )) T n (s, s ) = T (s.s ) = µ({x A F (s, x) = s } memory limited learning algorithms: m-memory limited if A(τ n ) only depends on last m sentences in text τ m and the previous grammatical hypothesis Fact: if H learnable by a memory limited algorithm, in fact learnable by a memoryless algorithm