Models of Language Acquisition: Part II

Size: px

Start display at page:

Download "Models of Language Acquisition: Part II"

Helen Cunningham
5 years ago
Views:

1 Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015

2 Probably Approximately Correct Model of Language Learning General setting of Statistical Learning Theory: objects of learning are functions concept class: set F of possible target functions hypothesis class set H of functions f : X Y typically assume F = H for language case: X = set A of all possible strings on an alphabet, Y = {0, 1} given a language L A consider the associated indicator function (characteristic function) χ L : A {0, 1} { 1 x L χ L (x) = 0 x / L

3 Language L can be seen as 1 recursively enumerable subset L A 2 Turing machine (program) that recognizes L 3 indicator function χ L distance function on the space of languages (something better than the discrete 0/1 metric): L 1 (P)-distance use a probability measure P on A (Bernoulli, Markov,...) define L 1 (P)-distance as d P (L, L ) = ɛ-neighborhood of a language L s A χ L (s) χ L (s) P(s) N ɛ (L) = {L d P (L, L ) < ɛ}

4 Examples are randomly presented to a learner according to the probability distribution P Note: both positive and negative examples view examples as pairs (x, y) with x A and y = χ L (x) Data set: D = k D k D k = {(z 1,..., z k ) z i = (x i, y i ), x i A, y i {0, 1}} learning algorithm A : D H after k data points learner conjectures a function ĥk H procedure that minimizes empirical risk: ĥ k (x j ) = arg min h H 1 k k y j h(x j ) j=1

5 ĥ k = A(δ k ) with δ k D k a random element successful learning: hypothesis ĥ k converges to target χ L as k hypothesis ĥ k is a random function (because δ k D k random) so need convergence in probabilistic sense ( ) lim P d P (ĥ k, χ L ) > ɛ = 0 k this means weak convergence of random variables ĥ k = A(δ k ) w χ L ( ) E P ĥ k (s) χ L (s) = ĥk(s) χ L (s) P(s) 0 s A Note double role of probability P: in defining L 1 (P)-distance for convergence; and also in drawing random data δ k D k

6 target function is L (t) = arg min L E P ( χ L (t) χ L ) a set of elements S = {s 1,..., s n } in A is shattered by the set of functions H if, for every set of binary vectors b = (b 1,..., b n ) there is a function h b H such that h b (x i ) = 1 iff b i = 1 this means that for every way of partitioning the set S into two parts, there is a function in H that implements the partition (H must have at least 2 n elements) Vapnik Chervonenkis dimension of H is D if there is at least one set of D elements that is shattered by H and no set of D + 1 elements is (if no such D then dim VC H = )

7 Learnability Fact: Set H of languages (identified with indicator functions χ L ): languages L of H are learnable iff Vapnik Chervonenkis dimension dim VC H < here learnability as weak convergence ĥ k = A(δ k ) w χ L ĥ k (x j ) = χ ˆL k (x j ) = arg min L empirical risk minimization 1 k k y j χ L (x j ) why finite Vapnik Chervonenkis dimension is needed? Lower bound on learnability... j=1

8 Lower bound for learning suppose Vapnik Chervonenkis dimension dim VC H = D construct a probability distribution P on A with respect to which learner needs to draw at least m D 4 log 2( 3 2 ) + log 2( 1 8δ ) in order to have P(d(ĥ m, h) > ɛ) < δ so in particular if D = don t have learnability (for this P)

9 construction of P since dim VC H = D have a set x 1,..., x D that is shattered by H: assign measure P(x i ) = 1 D to these points assign measure zero to all other points in A in this measure two functions h 1, h 2 H have distance d P (h 1, h 2 ) = 0 iff they agree on all points x i mod out H by equivalence relation h 1 h 2 if h 1 (x i ) = h 2 (x i ) for all 1 i D: set of equivalence classes H / has 2 D points

10 Partitioning of H draw a sequence z = (z 1,..., z m ) of random data according to the probability distribution P suppose z contains l distinct elements among the X = {x 1,..., x D } (the remaining D l do not occur in z) there are then 2 l possible ways in which can label z = z h by a potential candidate target function h H / these choices determine a partitioning of H into disjoint subsets H = H 1 H 2 l each H i in this partition contains exactly 2 D l different functions that agree on the l distinct elements in z

11 Estimate of sum d(a(z h ), h) = h H 2 l i=1 h H i d(a(z h ), h) the 2 D l functions in H i all agree on data set z h while on remaining D l elements of X the functions h and A(z h ) in H i disagree somewhere if A(z h ) and h disagree in j places then d(a(z h ), h) j/d and this can happen in ( ) D l j possible ways: D l ( ) D l j d(a(z h ), h) j D 2D l (D l) 2D h H i j=0 h H d(a(z h), h) 2D (D l) 2D

12 Candidate target function set S l = {z z has l distinct elements} P(z) 1 2 D z S l h H d(a(z), h) D l 2D P(S l) change order of sum: 2 D h z P(z)d(A(z), h) to have inequality there must be at least one h = h with z S l P(z)d(A(z), h ) D l 2D P(S l) this h = h is a candidate target function with a certain estimate of inaccuracy of learning hypothesis

13 Inaccuracy estimate Set of draws of m data on which learner s hypothesis A(z) differs from candidate target h by more than a given size β: S β = {z S l d(a(z), h ) > β} lower bound on P(S β ): D l 2D P(S l) z S β P(z)d(A(z), h ) + P(S β ) + β(p(s l ) P(S β )) gives P(S β ) (1 β)p(s β ) ( D l 2D β)p(s l) z S l S β P(z)d(A(z), h )

14 arrange for P(S ɛ ) > δ if target h then with probability P(S ɛ ) learner hypothesis more than ɛ away from target take arbitrary l to be l = D/2 and ɛ < 1/8, then ( D l 2D β)p(s l) > 1 8 P(S l) if have P(S D/2 ) > 8δ get also P(S ɛ ) > δ so can arrange that probability of learner hypothesis differing from target more than ɛ is greater than δ find conditions for P(S D/2 ) > 8δ

15 arrange for P(S D/2 ) > 8δ P(S l ) = probability of drawing l distinct elements of X = {x 1,..., x D } in m identically distributed trials ( D l ) ways of choosing l elements; for each choice l! ways in which items can appear in first l positions S (i) l S l set of all z = (z 1,..., z m ) with i-th choice of placing the l distinct elements in first l positions (remaining m l positions: same l elements disposed in any way) P(S (i) l ) = ( 1 D )l ( l D )m l the S (i) l disjoint so P(S l ) ( ) D l! P(S (i) l l ) = ( ) D l! ( 1 l D )l ( l D )m l

16 for D = 2l have ( D l ) l! ( 1 D )l ( l D )m l = (2l)! l! l l 2 m also have (2l)! l l! l l = (1 + j l ) ( )l/2 j=1 then have P(S D/2 ) > 8δ for P(S D/2 ) 2 m ( 3 2 )D/4 m < D 4 log 2( 3 2 ) + log 2( 1 8δ ) conclusion: in constructed probability P learner needs at least m larger than above to achieve P(d(A(z), h ) > ɛ) < δ

17 Unlearnability problem remains! The set of all finite languages is unlearnable The set of all regular languages is unlearnable The set of all context-free languages is unlearnable impose further constraints on learning limit the size of grammars... constraint on the number of production rules

18 Example H n,k = class of Regular Grammars recognized by deterministic finite state automata with at most n states with fix number of letters #A = k size of this family #H n,k ( ) n n k then Vapnik Chervonenkis dimension ( ) n n dim VC H n,k log 2 ( ) nk log k 2 (n)

Language Acquisition and Parameters: Part II

Language Acquisition and Parameters: Part II Matilde Marcolli CS0: Mathematical and Computational Linguistics Winter 205 Transition Matrices in the Markov Chain Model absorbing states correspond to local