Lerning probbilistic finite utomt Colin de l Higuer University of Nntes Zdr, August 00
Acknowledgements Lurent Miclet, Jose Oncin, Tim Otes, Rfel Crrsco, Pco Cscubert, Rémi Eyrud, Philippe Ezequel, Henning Fernu, Thierry Murgue, Frnck Thollrd, Enrique Vidl, Frédéric Tntini,... List is necessrily incomplete. Excuses to those tht hve been forgotten http://pgesperso.lin.univ-nntes.fr/~cdlh/slides/ Chpters 5 nd 6 Zdr, August 00
Outline. PFA. Distnces between distributions 3. FFA 4. Bsic elements for lerning PFA 5. ALERGIA 6. MDI nd DSAI 7. Open questions Zdr, August 00 3
PFA Probbilistic finite (stte) utomt Zdr, August 00 4
Prcticl motivtions (Computtionl biology, speech recognition, web services, utomtic trnsltion, imge processing ) A lot of positive dt Not necessrily ny negtive dt No idel trget Noise Zdr, August 00 5
The grmmr induction problem, revisited The dt consists of positive strings, «generted» following n unknown distribution The gol is now to find (lern) this distribution Or the grmmr/utomton tht is used to generte the strings Zdr, August 00 6
Success of the probbilistic models n-grms Hidden Mrkov Models Probbilistic grmmrs Zdr, August 00 7
4 b 3 b 3 4 b 3 DPFA: Deterministic Probbilistic Finite Automton Zdr, August 00 8
b 3 4 b 3 4 b 3 Pr A (bb)= 3 4 Zdr, August 00 9 3 3 4 =
0.7 0. b 0.9 0.35 0.7 0.3 b 0.3 b 0.65 Zdr, August 00 0
4 b b 3 3 4 b 3 PFA: Probbilistic Finite (stte) Automton Zdr, August 00
ε 4 b ε 3 ε 3 4 b 3 ε-pfa: Probbilistic Finite (stte) Automton with ε-trnsitions Zdr, August 00
How useful re these utomt? They cn define distribution over Σ * They do not tell us if string belongs to lnguge They re good cndidtes for grmmr induction There is (ws?) not tht much written theory Zdr, August 00 3
Bsic references The HMM literture Azri Pz 973: Introduction to probbilistic utomt Chpter 5 of my book Probbilistic Finite-Stte Mchines, Vidl, Thollrd, cdlh, Cscubert & Crrsco Grmmticl Inference ppers Zdr, August 00 4
Automt, definitions Let D be distribution over Σ * 0 Pr D (w) w Σ* Pr D (w)= Zdr, August 00 5
A Probbilistic Finite (stte) Automton is <Q, Σ, I P, F P, δ P > Q set of sttes I P : Q [0;] F P : Q [0;] δ P : Q Σ Q [0;] Zdr, August 00 6
Wht does PFA do? It defines the probbility of ech string w s the sum (over ll pths reding w) of the products of the probbilities Pr A (w)= π i pths(w) Pr(π i ) π i =q i 0 i q i i in q in Pr(π i )=I P (q i 0 ) F P (q in ) ij δ P (q ij-, ij,q ij ) Note tht if λ-trnsitions re llowed the sum my be infinite Zdr, August 00 7
0.7 0.3 0. 0. b 0.4 b b 0.4 0.45 0. 0.35 Pr(b) = 0.7*0.4*0.* +0.7*0.4*0.45*0. = 0.08+0.05=0.053 Zdr, August 00 8
non deterministic PFA: mny initil sttes/only one initil stte n λ-pfa: PFA with λ-trnsitions nd perhps mny initil sttes DPFA: deterministic PFA Zdr, August 00 9
Consistency A PFA is consistent if Pr A (Σ * )= x Σ * 0 Pr A (x) Zdr, August 00 0
Consistency theorem A is consistent if every stte is useful (ccessible nd coccessible) nd q Q F P (q) + q Q, Σ δ P (q,,q )= Zdr, August 00
Equivlence between models Equivlence between PFA nd HMM But the HMM usully define distributions over ech Σ n Zdr, August 00
A footbll HMM win drw lose win drw lose win drw lose 4 4 4 4 4 4 4 4 4 4 3 4 3 4 Zdr, August 00 3
Equivlence between PFA with λ-trnsitions nd PFA without λ-trnsitions cdlh 003, Hnneforth & cdlh 009 Mny initil sttes cn be trnsformed into one initil stte with λ-trnsitions; λ-trnsitions cn be removed in polynomil time; Strtegy: number the sttes eliminte first λ-loops, then the trnsitions with highest rnking rrivl stte Zdr, August 00 4
PFA re strictly more powerful thn DPFA Folk theorem (nd) You cn t even tell in dvnce if you re in good cse or not (see: Denis & Esposito 004) Zdr, August 00 5
Exmple 3 3 This distribution cnnot be modelled by DPFA Zdr, August 00 6
Wht does DPFA over Σ ={} look like? And with this rchitecture you cnnot generte the previous one Zdr, August 00 7
Prsing issues Computtion of the probbility of string or of set of strings Deterministic cse Simple: pply definitions Techniclly, rther sum up logs: this is esier, sfer nd cheper Zdr, August 00 8
0.7 0. b 0.9 0.35 0.7 0.3 b 0.3 b 0.65 Pr(b) = 0.7*0.9*0.35*0 = 0 Pr(bb) = 0.7*0.9*0.65*0.3 = 0.85 Zdr, August 00 9
Non-deterministic cse 0.7 0.3 0. 0. b 0.4 b b 0.4 0.45 0. 0.35 Pr(b) = 0.7*0.4*0.* +0.7*0.4*0.45*0. = 0.08+0.05=0.053 Zdr, August 00 30
In the literture The computtion of the probbility of string is by dynmic progrmming : O(n m) lgorithms: Bckwrd nd Forwrd If we wnt the most probble derivtion to define the probbility of string, then we cn use the Viterbi lgorithm Zdr, August 00 3
Forwrd lgorithm A[i,j]=Pr(q i.. j ) (The probbility of being in stte q i fter hving red.. j ) A[i,0]=I P (q i ) A[i,j+]= k Q A[k,j]. δ P (q k, j+,q i ) Pr(.. n )= k Q A[k,n]. F P (q k ) Zdr, August 00 3
Distnces Wht for? Estimte the qulity of lnguge model Hve n indictor of the convergence of lerning lgorithms Construct kernels Zdr, August 00 33
. Entropy How mny bits do we need to correct our model? Two distributions over Σ * : D et D Kullbck Leibler divergence (or reltive entropy) between D nd D : w Σ* Pr D (w) log Pr D (w)-log Pr D (w) Zdr, August 00 34
. Perplexity The ide is to llow the computtion of the divergence, but reltively to test set (S) An pproximtion (sic) is perplexity: inverse of the geometric men of the probbilities of the elements of the test set Zdr, August 00 35
w S Pr D (w) -/ S = S w S Pr D (w) Problem if some probbility is null... Zdr, August 00 36
Why multiply () We re trying to compute the probbility of independently drwing the different strings in set S Zdr, August 00 37
Why multiply? () Suppose we hve two predictors for coin toss Predictor : heds 60%, tils 40% Predictor : heds 00% The tests re H: 6, T: 4 Arithmetic men P: 36%+6%=0,5 P: 0,6 Predictor is the better predictor ;-) Zdr, August 00 38
.3 Distnce d d (D, D )= w Σ * (Pr D (w)-pr D (w)) Cn be computed in polynomil time if D nd D re given by PFA (Crrsco & cdlh 00) This lso mens tht equivlence of PFA is in P Zdr, August 00 39
3 FFA Frequency Finite (stte) Automt Zdr, August 00 40
A lerning smple is multiset Strings pper with frequency (or multiplicity) S={λ (3), (4), b (), bb (), bb (3), bb ()} Zdr, August 00 4
DFFA A deterministic frequency finite utomton is DFA with frequency function returning positive integer for every stte nd every trnsition, nd for entering the initil stte such tht the sum of wht enters is equl to wht exits nd the sum of wht hlts is equl to wht strts Zdr, August 00 4
Exmple : : 6 3 b : 5 b: 3 : 5 b: 4 Zdr, August 00 43
From DFFA to DPFA Frequencies become reltive frequencies by dividing by sum of exiting frequencies : /6 : /7 6/6 3/3 /6 /7 b: 5/3 b: 3/6 : 5/3 b: 4/7 Zdr, August 00 44
From DFA nd smple to DFFA S = {λ,, b, bbb, bbbb, bbbb} : : 6 3 b: 5 b: 3 : 5 b: 4 Zdr, August 00 45
Note Another smple my led to the sme DFFA Doing the sme with NFA is much hrder problem Typiclly wht lgorithm Bum-Welch (EM) hs been invented for Zdr, August 00 46
The frequency prefix tree cceptor The dt is multi-set The FTA is the smllest tree-like FFA consistent with the dt Cn be trnsformed into PFA if needed Zdr, August 00 47
From the smple to the FTA FTA(S) 4 3 :7 b:4 :4 4 :6 : b: b: : b: : b:4 3 : : : S={λ (3), (4), b (), bb (), bb (3), bb ()} Zdr, August 00 48
Red, Blue nd White sttes -Red sttes re confirmed sttes -Blue sttes re the (non Red) successors of the Red sttes -White sttes re the others b b b b Sme s with DFA nd wht RPNI does Zdr, August 00 49
Merge nd fold :6 Suppose we decide to merge with stte λ b 00 60 :0 0 b:6 :6 6 b:4 :4 :4 4 b:4 b:9 9 Zdr, August 00 50
Merge nd fold b:4 :6 First disconnect nd reconnect to λ b 00 60 :0 0 b:6 :6 6 :4 :4 4 b:4 b:9 9 Zdr, August 00 5
Merge nd fold b:4 Then fold :6 00 60 :0 0 b:6 b:4 :6 :4 b:9 6 9 :4 4 Zdr, August 00 5
Merge nd fold b:4 fter folding :6 00 60 :0 0 b:30 :0 0 :4 4 b:9 9 Zdr, August 00 53
Stte merging lgorithm A=FTA(S); Blue ={δ(q I,): Σ }; Red ={q I } While Blue do choose q from Blue such tht Freq(q) t 0 if p Red: d(a p,a q ) is smll then A = merge_nd_fold(a,p,q) else Red = Red {q} Blue = {δ(q,): q Red} {Red} Zdr, August 00 54
The rel question How do we decide if d(a p,a q ) is smll? Use distnce Be ble to compute this distnce If possible updte the computtion esily Hve properties relted to this distnce Zdr, August 00 55
Deciding if two distributions re similr If the two distributions re known, equlity cn be tested Distnce (L norm) between distributions cn be exctly computed But wht if the two distributions re unknown? Zdr, August 00 56
Tking decisions :6 Suppose we wnt to merge with stte λ b 00 60 :0 0 b:6 :6 6 b:4 :4 :4 4 b:4 b:9 9 Zdr, August 00 57
Tking decisions 60 :6 :0 b:4 0 b:6 Yes if the two distributions induced re similr :6 :4 6 :4 4 b:4 b:9 9 :4 :4 4 b:4 b:9 9 Zdr, August 00 58
5 Alergi Zdr, August 00 59
Alergi s test D D if x Pr D (x) Pr D (x) Esier to test: Pr (λ)=pr (λ) D D Σ Pr (Σ*)=Pr (Σ*) D D And do this recursively! Of course, do it on frequencies Zdr, August 00 60
Zdr, August 00 6 Hoeffding bounds n f n f n f γ α γ ln. + < n n γ indictes if the reltive frequencies nd re sufficiently close n f
A run of Alergi Our lerning multismple S={λ(490), (8), b(70), (3), b(4), b(38), bb(4), (8), b(0), b(0), bb(4), b(9), bb(4), bb(3), bbb(6), (), b(), b(3), bb(), b(), bb(), bb(), bbb(), b(), bb(), bb(), bbb(), bb(), bbb(), bbb(), (), b(), b(), b(), bb(), bb(), bb(), bbb()} Zdr, August 00 6
Prmeter α is rbitrrily set to 0.05. We choose 30 s vlue for threshold t 0. Note tht for the blue stte who hve frequency less thn the threshold, specil merging opertion tkes plce Zdr, August 00 63
000 490 :57 8 b : 70 70 : 64 : 57 b : 6 3 38 : 5 b : 8 b : 65 : 4 4 4 b : 9 : 3 b : 6 :5 b : 7 8 0 0 4 9 4 3 6 : 4 b : 3 : 5 b : 3 : b : : 4 b : : b : : b : : b : : Zdr, August 00 64 3 b b b
Cn we merge λ nd? Compre λ nd, Σ* nd Σ*, bσ* nd bσ* 490/000 with 8/57, 57/000 with 64/57, 53/000 with 65/57,.... All tests return true Zdr, August 00 65
000 490 Merge : 57 8 b: 70 70 : 64 : 57 b: 6 3 38 : 5 b: 8 b: 65 : 4 4 4 b: 9 : 3 b: 6 : 5 b: 7 8 0 0 4 9 4 3 6 : 4 b : 3 : 5 b : 3 : b : : 4 b : : b : : b : : b : : Zdr, August 00 66 3 b b b
And fold :34 000 660 b: 340 : 77 5 b: 38 5 0 : 6 b: 9 : 0 b: 8 7 6 7 : b: : b: : b: : Zdr, August 00 67
Next merge? λ with b? :34 000 660 b: 340 5 : 77 b: 38 5 0 : 6 b: 9 : 0 b: 8 7 6 7 : b: : b: : b: : Zdr, August 00 68
Cn we merge λ nd b? Compre λ nd b, Σ* nd bσ*, bσ* nd bbσ* 660/34 nd 5/340 re different (giving γ= 0.6) On the other hnd n + n. ln α = 0. Zdr, August 00 69
Promotion :34 000 660 b: 340 5 : 77 b: 38 5 0 : 6 b: 9 :0 b: 8 7 6 7 : b : : b : : b : : Zdr, August 00 70
Merge : 34 : 77 660 b: 340 5 b: 38 5 0 : 6 b: 9 : 0 b: 8 7 6 7 : b : : b : : b : : Zdr, August 00 7
And fold :34 : 95 000 660 b: 340 9 b: 49 : 9 7 : b: b: 9 8 : Zdr, August 00 7
Merge :34 : 95 b: 340 000 660 b: 49 5 9 : 7 : b: b: 9 8 : Zdr, August 00 73
And fold : 354 : 96 b: 35 000 698 30 b: 49 As PFA :.354 :.096 b:.35.698.30 b:.049 Zdr, August 00 74
Conclusion nd logic Alergi builds DFFA in polynomil time Alergi cn identify DPFA in the limit with probbility No good definition of Alergi s properties Zdr, August 00 75
6 DSAI nd MDI Why not chnge the criterion? Zdr, August 00 76
Criterion for DSAI Using distinguishble string Use norm L Two distributions re different if there is string with very different probbility Such string is clled μ-distinguishble Question becomes: Is there string x such tht Pr A,q (x)-pr A,q (x) >μ Zdr, August 00 77
(much more to DSAI) D. Ron, Y. Singer, nd N. Tishby. On the lernbility nd usge of cyclic probbilistic finite utomt. In Proceedings of Colt 995, pges 3 40, 995. PAC lernbility results, in the cse where trgets re cyclic grphs Zdr, August 00 78
Criterion for MDI MDL inspired heuristic Criterion is: does the reduction of the size of the utomton compenste for the increse in preplexity? F. Thollrd, P. Dupont, nd C. de l Higuer. Probbilistic Df inference using Kullbck-Leibler divergence nd minimlity. In Proceedings of the 7th Interntionl Conference on Mchine Lerning, pges 975 98. Morgn Kufmnn, Sn Frncisco, CA, 000 Zdr, August 00 79
7 Conclusion nd open questions Zdr, August 00 80
A good cndidte to lern NFA is DEES Never hs been chllenge, so stte of the rt is still uncler Lots of room for improvement towrds probbilistic trnsducers nd probbilistic context-free grmmrs Zdr, August 00 8
Appendix Stern Brocot trees Identifiction of probbilities If we were ble to discover the structure, how do we identify the probbilities? Zdr, August 00 8
By estimtion: the edge is used 50 times out of 3000 pssges through the stte : 3000 50 3000 Zdr, August 00 83
Stern-Brocot trees: (Stern 858, Brocot 860) Cn be constructed from two simple djcent frctions by the «men» opertion c +c m b d = b+d Zdr, August 00 84
0 0 3 3 3 3 3 3 4 5 5 4 4 5 5 4 3 3 Zdr, August 00 85
Ide: Insted of returning c(x)/n, serch the Stern-Brocot tree to find good simple pproximtion of this vlue. Zdr, August 00 86
Iterted Logrithm: With probbility, for co-finite number of vlues of n we hve: c(x) - < λ log log n n b n λ> Zdr, August 00 87