Logic-based probabilistic modeling language. Syntax: Prolog + msw/2 (random choice) Pragmatics:(very) high level modeling language

Size: px

Start display at page:

Download "Logic-based probabilistic modeling language. Syntax: Prolog + msw/2 (random choice) Pragmatics:(very) high level modeling language"

Isabel Matthews
5 years ago
Views:

1 1

Logic-based probabilistic modeling language Turing machine with statistically learnable state transitions Syntax: Prolog + msw/2 (random choice) Variables, terms, predicates, etc available for p.

2 Logic-based probabilistic modeling language Turing machine with statistically learnable state transitions Syntax: Prolog + msw/2 (random choice) Variables, terms, predicates, etc available for p.-modeling Semantics: distribution semantics Program DB defines a probability measure P DB ( ) on least Herbrand models Pragmatics:(very) high level modeling language Just describe probabilistic models declaratively Implementation: Currently on top of B-Prolog (tabled search) Single data structure : expl. graphs, dynamic programming 2

Formal semantics EM learning Linear tabling

search Negation 2010 Prism2.0 2008 Prism1.

3 Formal semantics EM learning Linear tabling Negative goals Prism Prism1.8 Distribution semantics PRISM Tabled search Negation 2010 Prism Prism Prism Prism1.9 Open source Modeling environment Variational Bayes Belief propagation MCMC Viterbi-learning non-para Bayes Ease of modeling Bayesian approach BN subsumed Download the PRISM package at 3

4 Declarative modeling : just specify your model Inventing/implementing new algorithms not required Statistical relational learning Eliminating pains of building & testing similar models Many types of inference and learning at no extra cost Exact-P,MLE,MAP,VB,MCMC,Viterbi,BIC,Cheeseman-Statz,VFE Model 1 Model 2... Model n Model 1 Model 2... Model n PRISM m n << n... Alg 1 Alg 2 Alg n Alg 1 Alg 2... Alg m 4

5 Bayes inference MCMC for PCFGs Generalization for PRISM MH sampling MCMC examples Marginal likelihood Model selection 5

6 Introduce prior P(x,y,q)=P(x,y q)p(q) x:hidden (parse tree), y:data (sentence) q:parameters, P(q):prior distribution Date sparseness, model selection Basic tasks: marginal likelihood P(y) = x P(x,y q)p(q)dq model selection a posterior P(q y) = x P(x,y q)p(q)/p(y) prediction P(x y) = P(x q)p(q y)dq Viterbi x* = argmax x P(x y) 6

7 Rule probabilities have prior distributions Dirichlet prior P D (q A a A ) q 1 a q k ak-1 for R(A)={A b 1,..,A b k } P D (q a ) = A N P D (q A a A ) where q ={q A } A N P G (t q ) = r R q f(r,t) r where f(r,t) = freq. of r in a parse tree t P G (w q ) = t:yield(t)=w P G (t q ) P G (w q ) = n i=1p G (w i q ) where corpus w =(w 1,..,w n ) Posterior P(t,q w,a) P D (q a)p G (t q )P G (w t ) where parses t =(t 1,..,t n ) Computing the posterior P(t,q w,a ) = P(t w,a )P(q t,w,a ) t ~P(t w,a ) can be sampled by Metropolis-Hastings sampling q ~P(q t,w,a ) = P(q t,a ) is easy (from posterior Dirichlet) 7

8 M. Johnson et al. proposed an MCMC algorithm for P(t w,a ) [Johnson et al. 07a] The basic idea: corpus w =(w 1,..,w n ) has state t =(t 1,..,t n ) Repeat Choose i-th sentence w i randomly from {w 1,..,w n } Sample a new parse tree t i ~ P(t i w i,q * ) where for production rul r= A... in G f r (t -i )+a r q r* = = E[q r t -i,a] (t -i : trees except t i ) r R(A) f r (t -i )+a r State change occurs t i t i with acceptance prob. P(t i w i,t -i,a)p(t i w i,q * ) A(t i, t i ) = min{ 1, } P(t i w i,t -i,a)p(t i w i,q * ) 8

9 Sample a new parse tree t i ~ P(t i w i,q * ) Compute inside probabilities for w i using q * Start with S(0,n) (w i = u 1..u n ), and recursively, given A(i,k), choose A(i,k) B(i,j) C(j,k) with q * A BC P G (B * w i,j )P G (C * w j,k ) prob. = P G (A * w i,k ) Prob. of sampling t i equals r R q r f(r,ti ) /P G (S * w i q * ) = P(t i q * )/P(w i q * ) = P(t i w i,q * ) 10

btype(x):- gtype(gf,gm), pg_table(x,[gf,gm]).

; X=o,GT=[o,o] ; X=ab,(GT=[a,b];GT=[b,a])).

father a b b AB o mother a o A child B (probabilistic

3, (parameter) P DB (msw(abo,a)=x 1,msw(abo,b)=x

10 btype(x):- gtype(gf,gm), pg_table(x,[gf,gm]). pg_table(x,gtype):- ((X=a;X=b),(GT=[X,o];GT=[o,X];GT=[X,X]) ; X=o,GT=[o,o] ; X=ab,(GT=[a,b];GT=[b,a])). gtype(gf,gm):- msw(abo,gf),msw(abo,gm). father a b b AB o mother a o A child B (probabilistic switch) P msw (msw(abo,a)=1) = q (abo,a) = 0.3, (parameter) P DB (msw(abo,a)=x 1,msw(abo,b)=x 2,msw(abo,o)=x 3, btype(a)=y 1,btype(b)=y 2,btype(ab)=y 3,btype(o)=y 4 ) P DB (btype(a)=1) = 0.4 (parameter learning is inverse direction) 11

11 PCFG Sentence W Parse tree t A(i,k) B(i,j)C(j,k), A(i,k) C(i,m)D(m,k) Inside algorithm PRISM Goal G Explanation e=msw 1.. msw m A(i,k) (msw(a,[b,c]) B(i,j) C(j,k)) (msw(a,[c,d]) C(i,m) D(m,k))) Generalized inside algorithm Generalization is straightforward We sample (e 1,..,e K )~ P(e 1,..,e K G 1,..,G K ) Correctness provable w.r.t. the distribution semantics of PRISM 12

12 Given iid goals G = G 1,..,G K Choose i randomly from {1,..,K} Compute q * E[q G -i ] using pseudo counts obtained from PRISM s VB learning routine Sample e i ~ P(e i G i,q * ) Compute G i (msw 1 H 1 ) (msw m H m ) using PRISM s probfi Recursively choose a disjunct with prob. P DB (msw k H k )/P DB (G) State change e i e i with acceptance prob. A(e i,e i ) % values(abo,[a,b,o],[0.5,0.2,0.3]).?- probfi(btype(a)) btype(a) [0.55] <=> gtype(a,a) [0.25] {0.25} v gtype(a,o) [0.15] {0.15} v gtype(o,a) [0.15] {0.15} gtype(a,a) [0.25] <=> msw(abo,a) [0.5] & msw(abo,a) [0.5] {0.25} gtype(a,o) [0.15] <=> msw(abo,a) [0.5] & msw(abo,o) [0.3] {0.15} gtype(o,a) [0.15] <=> msw(abo,o) [0.3] & msw(abo,a) [0.5] {0.15} 13

13 14

14 Estimate P(G a) = P(G 1,..,G K a) by PRISM MH sampling Recall (E =(e 1,..,e K )) P(q *,G ) P(q * )P(G q * ) P(G ) = = P(q * G ) E P(q * E )P(E G ) Sample E 1,..,E T ~ P(E G ) by MCMC and E P(q * E )P(E G ) P(q * E 1 ) + + P(q * E T ) T 15

15 Note Put q(x,q)=q(x)q(q) and vary q(x), q(q) to make the rhs (VFE, variational free energy) close to the lhs Then q(x) p(x y), q(q) p(q y) VB-EM 16

16 17

17 Two states, two outputs 18

18 ATR corpus (10,000) 860 rule clauses 19

19 ATR corpus (10,000) 2,500 rule clauses 20

20 M = argmax M P(G 1,..,G K M) where M = M(parameters,structure) Two examples Number of clusters Length of profile HMM 21

21 predict 16 attributes (vote record) republican, n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y republican, n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,? democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n democrat, n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y democrat y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y democrat, n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y democrat, n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y republican, n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y republican, n,y,n,y,y,y,n,n,n,y,n,y,y,y,?,n D: 435 data 90.8% accuracy (5 CV) Naïve Bayes C A 1 A 16 P(C,A 1,,A 16 ) = P(C)P(A 1 C)..P(A 16 C) C = republican, democrat V i = y, n Learn P(A 1 C),..,P(A 16 C) from D Predict C for unknown A 1,,A 16 by C = argmax c P(C A 1,,A 16 ) 22

modeling part values(class,[democrat,republican]). % class labels values(attr(_,_),[y,n]). % all attributes have two values: y or n nbayes(c,vals):- msw(class,c),nbayes(1,c,vals). nbayes(_,_,[]):-!

%%%% Utilities vote_learn:- load_data_file(gs), learn(gs).

Get Js = [1,...,N] (B-Prolog built-in) maplist(j,rate,vote_cv(gs,j,n,rate),js,rates), avglist(rates,avgrate), % Get the avg.

..N) vote_cv(gs,j,n,rate):- format("<<<< Test #~d >>>>~n",[j]), separate_data(gs,j,n,gs0,gs1), learn(gs0), maplist(nbayes(c,vs),r,(viterbig(nbayes(c0,vs)),(c0==c->r=1;r=0)),gs1,rs), avglist(rs,rate),

22 modeling part values(class,[democrat,republican]). % class labels values(attr(_,_),[y,n]). % all attributes have two values: y or n nbayes(c,vals):- msw(class,c),nbayes(1,c,vals). nbayes(_,_,[]):-!. nbayes(j,c,[v Vals]):- choose(j,c,v), J1 is J+1,!, % cut is ok nbayes(j1,c,vals). choose(j,c,v):- ( nonvar(v) -> msw(attr(j,c),v) ; msw(attr(j,c),_) ). %%%% Utilities vote_learn:- load_data_file(gs), learn(gs). %% Batch routine for N-fold cross validation vote_cv(n):- random_set_seed(81729), load_data_file(gs0), % Load the entire data random_shuffle(gs0,gs), % Randomly reorder the data numlist(1,n,js), % Get Js = [1,...,N] (B-Prolog built-in) maplist(j,rate,vote_cv(gs,j,n,rate),js,rates), avglist(rates,avgrate), % Get the avg. of the precisions maplist(j,rate,format("test #~d: ~2f%~n",[J,Rate*100]), Js,Rates), format("average: ~2f%~n",[AvgRate*100]). %% Subroutine for learning and testing for J-th split data (J = 1...N) vote_cv(gs,j,n,rate):- format("<<<< Test #~d >>>>~n",[j]), separate_data(gs,j,n,gs0,gs1), learn(gs0), maplist(nbayes(c,vs),r,(viterbig(nbayes(c0,vs)),(c0==c->r=1;r=0)),gs1,rs), avglist(rs,rate), format("done (~2f%).~n~n",[Rate*100]). separate_data(data,j,n,learn,test):- length(data,l), L0 is L*(J-1)//N, % L0: offset of the test data (// - integer division) L1 is L*(J-0)//N-L0, % L1: size of the test data splitlist(learn0,rest,data,l0), % Length of Learn0 = L0 splitlist(test,learn1,rest,l1), % Length of Test = L1 append(learn0,learn1,learn). load_data_file(gs):- load_csv('uci/house-votes-84.data',gs0,[missing('?')]), % '?' in the data will be converted into an anonymous variable (_) maplist(csvrow([c Vs]),nbayes(C,Vs),true,Gs0,Gs). utility part 23

23 Hidden node HC added P(C,A 1,,A 16 ) = HC P(C,A 1,,A 16,HC) C HC A 11 A 16 C = republican, democrat A i = yes, no (vote record) 24

24 MCMC is now available for any PRISM program applicable to model selection (structure learning) and viterbi inference But it takes time and memory (because of Prolog implementation) Oftentimes VB seems a reasonable choice 25

An introduction to PRISM and its applications

An introduction to PRISM and its applications Yoshitaka Kameya Tokyo Institute of Technology 2007/9/17 FJ-2007 1 Contents What is PRISM? Two examples: from population genetics from statistical natural