Computational Intelligence Seminar (8 November, 2005, Waseda University, Tokyo, Japan) Statistical Physics Approaches to Genetic Algorithms Joe Suzuki 1 560-0043 1-1 Abstract We address genetic algorithms (GAs) in a dierent way. The GA selects individuals with high tness values. This corresponds to decrease the temperature of the Boltzmann distribution at each generation. Then, only the individuals with the highest tness values survive. However, this does not happen because the length L of each individual is assumed to be large. Otherwise, exhaustive search can be executed. We consider several strategies to estimate the Boltzmann distribution without consuming exponential computation of L. How to apply eld/bethe/kikuchi approximations and factorizations of the distribution are discussed. The same problem can be solved by dierentiating the Kullback-Leibler information under several constraints like as done in solving probablistic inference based on generalized belief propagation. 1 (GA) 1 Estimation of Distribution Algorithms (EDA) GA (BN) (M ) N( M) M N M EDA estimation GA M i f(i) t t e t log f (i) =(f(i)) t t t P L 2 i2f0 1g log f (i) L et L M GA EDA Heinz Muhlenbein \The Estimation of Distributions and the Minimum Relative Entropy Principle" (to appear in Evolutionary Computation 2005) GA L L GA 1 E-mail: suzuki@math.sci.osaka-u.ac.jp 1
L BN (L ) BN 2 Bethe (BN) GA GA 2. GA GA 3. GA GA \ GA" 4. EDA BN 5. BN EDA 6. BN EDA BN Chow-Liu 7. Heinz Muhlenbein GA 8. Muhlenbein 2.-6. Kullback-Leibler D(pjjq) 7.-8. Kullback-Leibler D(qjjp) p q D(pjjq) D(qjjp) 2 GA L 2 k =1 2 L k ( 2) k =2 {z } L {z } L {z } L J := f0 0 01 11g M f : J! R + f(i) i 2 J 2.1 GA c[i] i 2 J 2 2 x y x y x y 2 jij i 2 J 1 i i 2 J 0,1 2 2
( ) : i 2 J c[i]f(i) j2j c[j]f(j) (1) : 2 k j 2 J 2 k i i j j i i k 2 J (i 2 J) 1/2 : k 2 J i ( i2 J)k i 2 J 0,1 >0 i = jij (1 ; ) L;jij 1( ) 1. 2 i j 2 J 2. i j k 2 J 3. k k 0 2 J 4. 1.-3. M M ( ) (GA) 2.2 GA (c[i]) i2j c[i] =M c[i] 0 i2j ( ) 2 L (c[i]) i2j L =2 J = f00 01 10 11g M =3 (c[00] c[01] c[10] c[11]) (0 0 0 3) (0 0 1 2) (0 0 2 1) (0 0 3 0) (0 1 0 2) (0 1 1 1) (0 1 2 0) (0 2 0 1) (0 2 1 0) (0 3 0 0) (1 0 0 2) (1 0 1 1) (1 0 2 0) (1 1 0 1) (1 1 1 0) (1 2 0 0) (2 0 0 1) (2 0 1 0) (2 1 0 0) (3 0 0 0): 20 2 10 11 10 {z } M=3 (0 0 2 1) 10 10 11 (c[00] c[01] c[10] c[11]) {z } M=3 3
S L M GA s 2 S t 2 S Q =(Q(tjs)) t s2s Q qq = q q =(q(s)) s2s U f(i) i 2 J ( ) g p : R +! R + p 2 R + ( ) 0 <a<b 1. g p (a) <g p (b), p>0 2. g p (b)!1, p!1 g p (a) g p (x) =x p 2 (1) c[i]g p (f(i)) j2j c[j]g p (f(j)) : (2) p lim lim q(s) p!1!0 > 0 s 2 U =0 s 2 S ; U (3) (De Silva-Suzuki, 2005) (3) GA GA GA Q k 1 Q k 3 \ " GA 3.1 >0 i 2 J g(i) := 1 log f(i) (4) f(i) g(i) M t p t (i) (1) p t+1 (i) = j2j p t(i) expfg(i)g p t (j) expfg(j)g (5) 0 > 0 t := 0 + t p t (i) = j2j expf tg(i)g expf t g(j)g (6) 4
( ) t!1 g ( f ) p t (i) 0 1= t t p t (i) j 2 J = f0 1g L expf t g(j)g GA L (5) i 2 J (L ) 3.2 GA M t t ( ) (t ) 2(EDA( t )) 1. M N t ( M) 2. N t p t (i) 3. M N t M t N t GA ( ) GA Estimation of Distribution Algorithms (EDA) 3.3 10 GA John Holland \Adaptation in Nature and Articial Systems" De Silva-Suzuki(2005) GA Holland GA L H( J) t (c t [j]) j2j m(h t) := P i2h c[i]f(i) Pj2H c[j] (7) m(j t) GA ( ) 5
Em(H t +1) m(h t)(1 ; (H))(1 ; ) d(h) (8) E t (c[j]) j2j (H) H d(h) H m(h t +1) m(h t) ( E ) (8) P Holland M P (H t) = i2h p t(i) dp (H t) dt = P (H t)[m(h t) ; m(j t)] (9) GA (9) t =0 1 m(h t) P (H t) (6) 0 =0 =1 t = t (7) p t (i) =c[i]=m (9) dp (H t) dt = k2j p k (t) dp t (i) dt = p i (t)[g(i) ; j2j g(j)p j (t)] P P i2j p t(i)fg(i) ; j2j g(j)p j(t)g P j2j p = P (H t)(m(h t) ; m(j t)) j(t) (9) 4 BN (BN) N (1) (2) (L) (x (1) x (2) x (N) ) 2f0 1g L P ( (1) = x (1) (2) = x (2) (L) = x (L) ) L BN L < (1) < (2) < < (L) (i) f (1) = x (1) (i;1) = x (i;1) g f (k) g k2 (i), (i) f1 2 i; 1g ( (1) = fg) f (k) g k2 (i) f (i) g f (k) g k2f1 2 i;1gn (i) NY P ( (i) = x (i) j (1) = x (1) (i;1) = x (i;1) ) = i=1 i=1 NY P ( (i) = x (i) j( (k) = x (k) ) k2 (i)) =( (1) (2) (L) ) BN i =1 2 L ( (k) = x (k) ) k2 (i) x (i) 2 (i) P ( (i) = x (i) j( (k) = x (k) ) k2 (i)) =1 6
(i) = (P ( (i) = x (i) j( (k) = x (k) ) k2 (i))) x (i) 2 (i) x (k) 2 (k) k2 (i) =( (1) (2) (L) ) BN BN BN= ( ) M x 2f0 1g LM (1) = x (1) 1 (2) = x (2) 1 (L) = x (L) 1 (1) = x (1) 2 (2) = x (2) 2 (L) = x (L) 2 (1) = x (1) M (2) = x (2) M (L) = x (L) M x ^ L P j =1 2 L (j) 2 j;1 L j=1 2j;1 =2 L ; 1 \ GA" \ GA" L L f(i) i 2 J BN 5 GA i =(x (1) x (2) x (L) ) 2f0 1g L q t (k) (x (k) )= p t (x (1) x (2) x (L) ) (x (j) ) j6=k q t (k) (j) j =0 1 ^q (k) c 1 + a 1 t (1) := (10) M + a 1 + a 0 c 1 t k x (k) =1 a 0 a 1 > 0 q t (k) (x (k) ) (M a =0=a 1 =1=2 ) 2 3 2. k =1 2 L (10) 3. k =1 2 L x (k) Univariate Marginal Distribution Algorithm (UMDA) p t (i) q t (i) = Q L q(k) t (x (k) ) ^q t (i) = Q L ^q t (i) q t (i) Kullback-Leibler D(p t jjq t )= i2j q t (i) log q t(i) p t (i) ^q(k) t (x (k) ) O(L) 7
W (t) = P i2j q t(i)g(i), q (k) t = q t (k) (x (k) ) x (k) =0 1 Wright q (k) t+1 = q(k) t + q (k) t (1 ; q (k) t ) @W(t)=@q(k) t W (t) (11) 6 GA Bethe V = f1 2 Lg E = ffj kgjj k 2 V j 6= kg E E 0 E 0 V 2 p t (i) q t (i) = Qfk lg2e q(k l) t (x (k) x (l) ) (12) Qk2V [q(k) t (x (k) )] N(k);1 N(k) k 2 e e 2 E E ^q t (i) i =(x (1) x (2) x (L) ) 2 f0 1g L qt k l (j h) j h=0 1 q (k l) t (x (k) x (l) )= (x (j) ) j6=k l p t (x (1) x (2) x (L) ) ^q t (k l) c j h + a j h (j h) := (13) M + a 0 0 + a 0 1 + a 1 0 + a 1 1 c j h t k l x (k) = j x (l) = h a j h > 0 q t (k l) (j h) (M a j h =1=4 ) E p t Kullback-Leibler D(p t jjq t ) E k l 2 V k 6= l I(k l) := x (k) x (l) q (k l) t (x (k) x (l)) log q(k l) t (x (k) x (l) ) q (k) t (x (k) )q (l) t (x (l) ) (14) 3 (Chow-Liu ) 1. k l 2 V k 6= l I(k l) 2. E [ffk lgg e = fk lg 62 E k 6= l I(k l) e E [fe g E 3. fk lg 62 E k 6= l 2. D(p t jjq t )= k2v H(k) := x (k) H(k) ; fk lg2e I(k l) ;q (k) t (x (k) ) log q (k) t (x (k) ) (15) 8
3 greedy Kruscal p t (12) q t Kullback-Libler D(p t jjq t ) 3 V = f1 2 3g, I(1 2) >I(2 3) >I(3 1) > 0 E = ff1 2g f2 3gg V = f1 2 3 4g, I(1 2) > I(1 3) >I(2 3) >I(1 4) >I(2 4) >I(3 4) E = ff1 2g f1 3g f1 4gg q t q t q t (k l) (j h) := c j h M (16) q t 3 i2j ;c[i] log q t (i)+ k(e) 2 d n (17) fd n g MDL d n =logn k(e) q t 1 2 x (k) 0,1 2 1 2 k 3 I(k l) I(k l) :=I(k l) ; ( k ; 1)( l ; 1) d n (18) 2 (17) (Suzuki, 1993) (18) G =(V E) k 6=2 d n =0 d n 6=0 7 L (Factorized Distribution Algorithm, FDA) p t (x (1) x (L) )= LY j=1 p t (x (j) jx (1) x (j;1) ) (L ) GA 9
7.1 g : f0 1g L! R + I := f1 2 Lg s 1 s 2 s m g(x) = k g k (x sk ) (19) ((19) fg k g fs k g ) (x (j) ) j2sk x sk s 1 s 2 s m EDA I m fs k g m I m fd kg m (histories) fb kg m (residuals), fs k g m (separators) d 0 = fg d k := [ k j=1 s j b k := s k nd k;1 c k := s k \ d k;1 3 fs k g m running intersection property (RIP) 1. b k 6= fg, k =1 2 m 2. d m = fx 1 x 2 x m g 3. k =2 3 m c k s j j<k RIP s 1 s 2 s m s 1 s 2 s m s k1 s kl s k1 [[s kl (merge) s 1 =1 m =1 RIP merge RIP p t (x) :=e tg(x)) = P x 0 e tg(x0) p t (x) = my p t (x bk jx ck )= my p t (x bk x ck ) p t (x ck ) (20) J 23 J 31 J 12 2 R + g(x (1) x (2) x (3) )=J 23 x (2) x (3) + J 31 x (3) x (1) + J 12 x (1) x (2) I = f1 2 3g s 1 = f2 3g s 2 = f3 1g s 3 = f1 2g d 0 = fg d 1 = f2 3g d 2 = f1 2 3g d 3 = f1 2 3g b 1 = f2 3g b 2 = f1g b 3 = fg c 1 = fg c 2 = f3g s 1 c 3 = f1 2g 6 s 1 s 2 RIP 1 3 s 1 s 2 s 3 g(x (1) x (2) x (3) )=J 23 x (2) x (3) + J 31 x (3) x (1) + J 4 x (4) I = f1 2 3 4g s 1 = f2 3g s 2 = f3 1g s 3 = f4g d 0 = fg d 1 = f2 3g d 2 = f1 2 3g d 2 = f1 2 3 4g 10
b 1 = f2 3g b 2 = f1g b 3 = f4g c 1 = fg c 2 = f3g s 1 c 3 = fg s 1 RIP p t (x (1) x (2) x (3) )=p t (x (2) x (3) )p t (x (1) jx (3) )p t (x (4) )= p t(x (2) x (3) )p t (x (3) x (1) )p t (x (4) ) p t (x (3) ) 7.2 G =(V E) Y U V ( [ Y [ Z = V \ Y = Y \ Z = Z \ = fg) Y Z ( Y Z 1 ) <jzjy > G Y Z V Y Z (x 2 p(y z) > 0 y 2 Y z 2 Z p(xjy z) =p(xjz)) I( U Z) E ffu vgju v 2 V u 6= vg( ) x 2 y 2 Y Z x y ( x y ) <jzjy > G =) I( Z Y ) (21) (I-map) E (21) ( I-map) G =(V E) V ( ) (Pearl 1988) V = fx (1) x (L) g V G =(V E) G T (, junction tree) 1. T ( ) G V G T 2. T 2 C 1 C 2 V C 1 \ C 2 V C C 2 E C 1 C 0 V ( ) G 4( ) ( ) 1. G 4 2. G C 1 C m T 11
3. V := fc 1 C m g C 1 C 2 2V C 1 \ C 2 c(c 1 C 2 ) C 1 \ C 2 Kruscal T V c : VV!R + u v 2V c(u v) 2 u v 2V E greedy (Kruscal, 1956) G V V E p t (V )= (4) Q Q v2v p t(v) e2e p t(e) (1) (2) (3) @ @@ @R Figure 1: x (1) x (3) V = fx (1) x (2) x (3) x (4) g G =(V E) 1 x (1) x (3) C 1 = fx (1) x (2) x (3) g C 2 = fx (1) x (3) x (4) g 2 C 1 \ C 2 = fx (1) x (3) g (V = fc 1 C 2 g E = fc 1 \ C 2 g) p t (x (1) x (2) x (3) x (4) )= p t(x (1) x (2) x (3) )p t (x (1) x (3) x (4) ) p t (x (1) x (3) ) ( C 2V ) ( ) NP 7.3 s 1 s 2 s m 2 I ( ) ~q k (x sk )=~q k (x bk x ck ) 2 x bk x ck ~q k (x bk x ck )=1 (22) x bk ~q k (x bk x ck )=~q k (x ck ) (23) ~q 1 ~q m (consistent) RIP b k 6= fg k = 1 m q(x) := Q m ~q k(x bk jx ck ) P q(x) =1 q k (x bk jx ck )=~q k (x bk jx ck ) (24) 12
( ) RIP q k (x bk x ck ) 6= ~q k (x bk x ck ) (25) s 1 = f2 3g s 2 = f3 1g s 3 = f1 2g b 3 = fg q(x) =~q 1 (x b1 )~q 2 (x b2 jx c2 ) q x (3) q(x (1) x (2) x (3) )=~q 3 (x (1) x (2) ) fb k g fc k g ~q 1 ~q m x ~q k (x sk )= (x (j) ) j2insk q(x) (26) q (26) q H(q) :=; x q(x)logq(x) (27) q q Iterative Proportional Fitting (IPF) (Csiszar, 1975) q (0) ( 2 ;L ) x 2f0 1g L q ( +1) (x) ; q ( ) q k (x sk ) (x) P y q( ) (x sk y) x Ins k k =(( ; 1)mod m)+1 I = f1 2 3g s 1 = f2 3g s 2 = f3 1g s 3 = f1 2g m=3 q (0) (x (1) x (2) x (3) )=2 ;3 q (1) (x (1) x (2) x (3) )=~q 1 (x (2) x (3) )2 ;1 q (2) (x (1) x (2) x (3) )= ~q 1(x (2) x (3) )~q 1 (x (3) x (1) ) P x ~q 1(x (2) x (3) ) (2) ~q 3 (x (1) x (2) ) q I = f1 2 3g s 1 = f2 3g s 2 = f3 1g m=2 q (0) (x (1) x (2) x (3) )=2 ;3 q (1) (x (1) x (2) x (3) )=~q 1 (x (2) x (3) )2 ;1 q q (2) = q (3) = RIP =0 1 m m (20) (28) 13
8 BN f~q k g q p Kullback-Leibler D(qjjp) q p GA g(x) log f(x) ( ;g(x) ) U(q) G(q) U(q) :=; x q(x)g(x) G(q) :=U(q) ; H(q) Kullback-Leibler D(qjjp) D(qjjp) =U(q) ; H(q)+lnZ (29) P Z = x eg(x) q(x) =p(x) =e g(x) D(qjjp) =0 G(q) = ; ln Z GA (BN) GA q k D(qjjp) q k q k k =1 2 m t =1 p t (x) q t (x) p(x) q(x) log Z U(q) H(q) (19) f(x) = a s x s (30) si q U(q) = x q(x)f(x) = m a s q k (x s ) (31) 8.1 q k k =1 2 m 1 (m = L) q(x) = my q k (x) (32) H(q) =; m x (k) q(x (k) ) log q(x (k) ) (33) @D(qjjp) @q k = log q k + @U =0 1 ; q k @q k 14
q k = 1 1 + exp[@u=@q k ] fq k g (34) (34) 8.2 Bethe fs k g q(x) = P m ~q k(x bk jx ck ) H(q) =; m x bk x ck q k (x bk x ck ) log ~q k (x bk jx ck ) (35) RIP q k (x bk x ck ) 6= ~q k (x bk x ck ) H(q) ; m x bk x ck ~q k (x bk x ck ) log ~q k (x bk jx ck ) (36) L = U(q) ; H(q) ; ; m m x ck k (x ck )[ x bk ~q k (x bk x ck ) ; ~q k (x ck )] ; k [ ~q k (x bk x ck ) ; 1] ; x bk x ck m k [ x c k m x bk k (x bk )[ x ck ~q k (x bk x ck ) ; ~q k (x bk )] ~q k (x ck ) ; 1] (37) ~q k (x bk x ck ) ~q k (x ck ) BN BN ( ) y BN L ( ) x y x P (xjy) g(x) =; log P (xjy) U(q) =; x q(x)logp (xjy) I = f1 Lg L fs k gi 2 (Bethe ) RIP (singly connected) q(x) = Q Q (i j) ~q k(x (i) x (j) ) k [~q k(x (k) )] N(k);1 (38) x (i) x (j) N(k) k (BP) ~q k m BP BP Weiss-Freeman Generalized Belief Propagation(2004) 15
Bethe RIP (singly connected ) RIP fs k g singly connected RIP 9 Bethe 1. GA!1 BN 2. GA BN f0 1g L! R + GA BN BN GA ( NP ) SMAMIP 1. Csiszar, I. (1975) I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3(1):146{158, 1975 2. Chandi DeSilva and Suzuki, J. (2005) On the Stationary Distribution of GAs with Positive Crossover Probability. GECCO 2005, Washington DC, June 2005. 3. Chow, C. K. Liu, C. N. (1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3), 462-467. 4. Heinz Muhlenbein and Thilo Mahnig. (1998) FDA - A scalable evolutionary algorithm for the optimization of additively decomposed functions. in Evolutionary Computation, Volume 7, 1999, 353-376 5. Heinz Muhlenbein and Robin Hons. (2005) The Estimation of Distributions and the Minimum Relative Entropy Principle. to appear in Evolutionary Computation. 6. John Holand. (1975) Adaptation in Nature and Articial Systems. MIT Press. 16
7. J. Pearl. (1988) Probabilistic Reasoning in Intelligent Systems. Palo Alto, CA: Morgan Kaufmann. 8. Suzuki, J. (1993) A Construction of Bayesian Networks from Databases Based on the MDL principle. 1993 Uncertainty in Articial Intelligence conference: 266-273 (1993) Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties. IEICE Trans. on Foundamentals, E82-A: 2237-2245 (1999). 9. Suzuki, J. (1995) A Markov Chain Analysis on Simple Genetic Algorithms. IEEE Transactions on Systems, Man and Cybernatics, 25(2) pages 655-659, April 1995. 10. Suzuki, J. (1998) A further result on the Markov chain model of genetic algorithms and its application to a simulated annealing-like strategy. IEEE Transactions on Systems, Man and Cybernatics, 28(1) 11. Yedidia J.S., Freeman W.T. and Weiss Y. (2000) Generalized Belief Propagation. NIPS 2000 17