Entropy for complex systems and its use in path dependent non-ergodic processes

Size: px

Start display at page:

Download "Entropy for complex systems and its use in path dependent non-ergodic processes"

Coleen Andrews
5 years ago
Views:

1 Entropy for complex systems and its use in path dependent non-ergodic processes Stefan Thurner rio oct

2 with Rudolf Hanel and Bernat Corominas-Murtra BCM, RH, ST, PNAS 112 (2015) rio oct

3 I will not talk about... rio oct

4 Remember Shannon-Khinchin axioms SK1: S depends continuously on p g is continuous SK2: S maximal for equi-distribution p i = 1/W g is concave SK3: S(p 1, p 2,, p W ) = S(p 1, p 2,, p W, 0) g(0) = 0 SK4: S(AB) = S(A) + S(B A) note: write entropy as S[p] = W i g(p i ). If SK1-SK4 g(x) = kx ln x rio oct

5 Shannon-Khinchin axiom 4 is non-sense for CS SK4 corresponds to Markovian and ergodic processes SK4 violated for non-ergodic systems nuke SK4 rio oct

6 The Complex Systems axioms SK1 holds SK2 holds SK3 holds S g = W i g(p i ), W 1 Theorem: All systems for which these axioms hold (1) can be uniquely classified by 2 numbers, c and d (2) have the entropy S c,d = e 1 c + cd [ W i=1 Γ (1 + d, 1 c ln p i ) c e ] e Euler const rio oct

7 Classification of entropies: order in the zoo entropy c d S BG = i p i ln(1/p i ) 1 1 S q<1 = 1 p q i q 1 (q < 1) c = q < 1 0 S κ = i p i(p κ i p κ i )/( 2κ) (0 < κ 1) c = 1 κ 0 S q>1 = 1 p q i q 1 (q > 1) 1 0 S b = i (1 e bp i) + e b 1 (b > 0) 1 0 S E = p i p i 1 i(1 e p i ) 1 0 S η = i Γ(η+1 η, ln p i) p i Γ( η+1 η ) (η > 0) 1 d = 1/η S γ = i p i ln 1/γ (1/p i ) 1 d = 1/γ S β = i pβ i ln(1/p i) c = β 1 S c,d = i erγ(d + 1, 1 c ln p i) cr c d rio oct

8 Distribution functions of CS if used in max ent principle p (1,1) exponentials (Boltzmann distribution) p e ax p (q,0) power-laws (q-exponentials) p 1 (a+x) b p (1,d>0) stretched exponentials p e axb p (c,d) all others Lambert-W exponentials p e aw (xb ) NO OTHER POSSIBILITIES if SK4 is violated rio oct

9 q-exponentials (powers) Lambert-exponentials 10 0 (b) d=0.025, r=0.9/(1 c) 10 0 (c) r=exp( d/2)/(1 c) p(x) c=0.2 c=0.4 c=0.6 c=0.8 p(x) (0.3, 4) (0.3, 2) (0.3, 2) (0.3, 4) (0.7, 4) (0.7, 2) (0.7, 2) (0.7, 4) x x rio oct

10 1 (1,0) violates K2 compact support of distr. function BG entropy violates K2 Stretched exponentials asymptotically stable c (c,d) entropy, d<0 Lambert W 1 exponentials (c,0) q entropy, 0<q<1 (c,d) entropy, d>0 Lambert W 0 exponentials 0 (0,0) violates K d rio oct

11 I will talk about origin of power laws rio oct

12 Power laws are pests rio oct

13 they are everywhere its hard to control them you never get rid of them rio oct

14 Classical routes to scaling are limited statistical mechanics: at phase transitions self-organised criticality multiplicative processes with constraints preferential processes generalized entropies rio oct

15 Routes to scaling I: at phase transitions statistical physics: power laws emerge right at phase transition this happens at the critical point power laws in various quantities: critical exponents various materials have same critical exponents behave identical universality rio oct

16 Routes to scaling II: self-organised criticality systems find critical points themselves without tuning since they are at the critical point power laws examples: sandpiles, cascading, earthquakes, systemic risk (Wiesenfeld) rio oct

17 routes to scaling III: multiplicative processes with constraints Gaussian distribution: add random numbers from same source (CLT) power laws: multiply random numbers and impose some constraints. For example: minimum number can not be smaller than X examples: wealth distribution, city size, double Pareto, language,... rio oct

18 routes to scaling IV: preferential processes process repeats proportional to how many times it occurred before linking probability of new node to network is proportional to degree of existing nodes examples: network growth, language formation rio oct

19 routes to scaling V: other mechanisms Levy stable processes constraint optimisation extreme value statistics return times rio oct

20 a few examples rio oct

21 City size MEJ Newman (2005) multiplicative rio oct

22 Rainfall SOC rio oct

23 Landslides SOC rio oct

24 Hurrican damages secondary (multiplicative)??? rio oct

25 Financial interbank loans multiplicative / preferential rio oct

26 Forrest fires in various regions SOC? rio oct

27 Moon crater diameters MEJ Newman (2005)??? rio oct

28 Gamma rays from solar wind MEJ Newman (2005) rio oct

29 Movie sales SOC rio oct

30 Healthcare costs multiplicative??? rio oct

31 Words in books MEJ Newman (2005) preferential / random / optimization rio oct

32 Citations of scientific articles MEJ Newman (2005) preferential rio oct

33 Website hits MEJ Newman (2005) preferential rio oct

34 Book sales MEJ Newman (2005) preferential rio oct

35 Telephone calls MEJ Newman (2005) preferential rio oct

36 Earth quake magnitude MEJ Newman (2005) SOC rio oct

37 Seismic events SOC rio oct

38 War intensity MEJ Newman (2005)??? rio oct

39 Killings in wars??? rio oct

40 Size of war??? rio oct

41 Wealth distribution MEJ Newman (2005) multiplicative rio oct

42 Family names MEJ Newman (2005)??? rio oct

43 More power laws... networks: literally thousands of scale-free networks allometric scaling in biology terrorist attacks particle physics dynamics in cities fragmentation processes random walks crackling noise growth with random times of observation blackouts fossil record bird sightings fluvial discharge, contact processes anomalous diffusion... rio oct

44 One phenomenon many explanations Zipf law in word frequencies Simon: preferential attachment Mandelbrot: constraint optimization Miller: monkeys produce random texts (Sole: information theoretic: sender and receiver) rio oct

45 did we miss something? rio oct

46 Many processes are history-dependent history-dependence: future actions depend on history of past actions possible outcomes Ω = {1, 2, 3,, N} probability for next outcome: p(x = i, t) = f(histogram(ω) up to time t) often past actions constrain possibilities for future sample space of these processes reduces as they unfold rio oct

47 Example: History-dependent processes 1) 2) 3) ) 5) 6) rio oct

48 Sample-space reduces and has a nested structure Ω 1 Ω 2... Ω N Ω rio oct

49 p(i) p(n) xn nx Site i Site i rio oct

50 This gives exact power laws! Probability to visit site i p(i) = 1 i rio oct

51 Proof by induction Let N = 2. There are two sequences φ: either φ directly generates a 1 with p = 1/2, or first generates 2 with p = 1/2, and then a 1 with certainty. Both sequences visit 1 but only one visits 2. As a consequence, P 2 (2) = 1/2 and P 2 (1) = 1. Now suppose P N 1 (i) = 1/i holds. Process starts with dice N, and probability to hit i in the first step is 1/N. Also, any other j, N j > i, is reached with probability 1/N. If we get j > i, we get i in the next step with probability ( P j 1 (i), which leads to a recursive scheme for i < N, P N (i) = 1 N 1 + ) i<j N P j 1(i). Since by assumption P j 1 (i) = 1/i, with i < j N holds, some algebra yields P N (i) = 1/i. rio oct

52 The role of noise φ... Sample Space Reducing process (SSR) φ R... Random walk Mix both processes Φ (λ) = λφ + (1 λ)φ R, λ [0, 1] Add noise with strength (1 λ) any power becomes possible p(i) = i λ noise (1 λ) is a surprise factor for SSR process rio oct

53 The role of noise result is exact too Clearly p (λ) (i) = N j=1 P (i j) p(λ) (j) holds, with P (i j) = λ j λ 1 λ N for i < j (SSR) N for i j > 1 (RW ) 1 N for i j = 1 (restart) We get p (λ) (i) = 1 λ N + 1 N p(λ) (1) + N j=i+1 λ j 1 p(λ) (j) to recursive relation p (λ) (i + 1) p (λ) (i) = λ 1 i p(λ) (i + 1) p (λ) (i) p (λ) (1) = i 1 exp j=1 ( ) 1 [ 1 + λ j = exp ) ( i 1 j=1 λ j i 1 j=1 log (1 + λ j exp ( λ log(i)) = i λ )] rio oct

54 History-dependent processes with noise b) 10 0 =1.0 =0.5 = =0.0 = Distance Number of site visits a) sim Rank Number of jumps T 4 10 How fast does power law converge to its limiting distribution? Answer: same convergence speed as central-limit theorem for iid processes (Berry-Esseen theorem) rio oct

55 SSR based Zipf law is extremely robust a) b) p(i) p(n) c) n Stair i Stair n i x'' x' x 0 1 N rio oct

56 Zipf law is remarkably robust accelerate SSR rio oct

57 SSR based Zipf law is extremely robust if priors are power q(x) x α, with α > 1 ZIPF if priors are polynomials ZIPF if priors are exponentials q(x) e βx with β > 0 UNIFORM rio oct

58 Conjecture: SSR has 2 attractors for limit distributions Zipf uniform any others? Noise: Zipf turns into power law with exponent λ rio oct

59 Language and word frequencies rank ordered distribution of word frequencies follows approximate power law f(r) r α α 1 closer look: α varies book by book quest for understanding: almost half a century: Zipf, Simon, Mandelbrot, Miller, Sole,... no-one of them can explain variation in α rio oct

60 The origin of species rio oct

61 How is a sentence formed? Its history dependent! start with any word in the dictionary second word is constraint by grammar second word is constraint by context to convey meaning: must reduce sample space come to the point! if sentence is not SSR it sounds crazy! rio oct

62 funny wolf word runs funny howls bites house can rucksack room light moon night green eats door open blue sun wind cloud buy high short grey jacket computer algebra write strong letter table fill information article punch hold line brown trash tree deer rabbit space fly mosquito bee window think red building hospital glass wine take sell plane envelope word runs funny howls bites house can rucksack room light moon night green eats door open blue sun wind cloud buy high short grey jacket computer algebra write strong letter table fill information article punch hold line brown trash tree deer rabbit space fly mosquito bee window think red building hospital glass wine take sell plane envelope wolf wolf word house can rucksack room light moon night green door open blue sun wind cloud buy high short jacket computer algebra write letter table fill information article punch hold line brown trash tree deer rabbit space fly mosquito bee window think red building hospital glass wine take sell plane envelope runs howls bites eats grey strong wolf word runs funny howls bites house can rucksack room light green eats door open blue sun wind cloud buy high short grey jacket computer algebra write strong letter table fill information article punch hold line brown trash tree deer rabbit space fly mosquito bee window think red building hospital glass wine take sell plane envelope moon night a) b) c) d) rio oct

63 Which word follows which? word i word i (a) word following i (b) sample space vol Ω i rio oct

64 Model for sentence formation: random books given: word-transition matrix M random sentences vocabulary is N words sentence length is L = pick one of N words. Say i. Write i in wordlist W = {i} jump to line i and pick word from Ω i. Say k W = {i, k} jump to line k, pick word from Ω k ; say j W = {i, k, j} repeat L times random sentence is formed repeat the process to produce N sent sentences random book has same word-transition matrix as actual book rio oct

65 Zipf exponent for random books 1.05 α model Decent of man American tragedy Ulysses David Copperfield Tale of two cities 0.85 Romeo and Juliet Origin of species Henry V Hamlet (a) Different forms of flowers α rio oct

66 Surprise factor how nested is a book? strict nestedness is clearly unrealistic. to what extent are there surprises? want number that specifies to what extent sample-space reduction is present in text nestedness n(m) = Ωi Ω j min( Ω i, Ω j ) (i,j) rio oct

67 Surprise factor explains Zipf exponent of individual books David Copperfield American tragedy nestedness n(m) Ulysses Tale of two cities Decent of man Origin of species Different forms of flowers Hamlet Henry V Romeo and Juliet (b) α rio oct

68 Zipf exponent and sample space α model (c) κ N=1000 N=10000 N= rio oct

69 SSR random walks on networks imagine a random network: directed, ordered, random pick start-node and end-node perform a random walk from start-node to end-node prediction visiting frequency of nodes follows Zipf law rio oct

70 Random walk on directed ordered NW is SSR a) Start 1/5 1/4 1/3 1/ pexit=1) pexit=0) 1 (1-pexit)/5 1 pexit Stop b) Node occupation probability 1/2 1/4 1/8 acyclic pexit=0.3 p -1 Path probability Path rank Node rank note fully connected rio oct

71 RW on not fully connected ER graphs are SSR rio oct

72 RW on various NW topologies rio oct

73 SSR processes and fragmentation stick of initial length L (spaghetti) mark it at some point break stick at a random position keep fragment with the mark and record its length break this fragment again at a random position, take fragment with mark, record its length repeat until fragment with mark reaches a minimal length, say the diameter of a spaghetti atom, note: size of all fragments more complicated than SSR (Krapivsky1994) rio oct

74 What is entropy of SSR process? Entropy is used to talk about three very different things (1)Entropy has to do with quantifying information production of a source (2)Entropy has to do with thermodynamics: for example du = T ds pdv (3)Entropy has to do with statistical inference: maximum for system in equilibrium For ergodic systems these topics are related In all cases entropy looks something like S = k W p i log p i i=1 This is not necessarily true for complex, non-ergodic systems! rio oct

75 Information production rate of SSR S IP = 1 2 log N + 1 rio oct

76 Phasespace growth of SSR S thermo = S c,d with (c, d) = (1 1 8, 0) rio oct

77 Max ent of SSR S max ent = W i=2 [ ( ) pi p i log p 1 + (p 1 p i ) log ( 1 p )] i p 1 rio oct

78 Conclusions strict SSR processes produce exact Zipf laws noise contribution determines power exponent applications in language formation diffusion on networks fragmentation processes human behaviour? SSR processes new & independent route to understand scaling SSR processes are non-ergodic and have 3 entropies rio oct

79 MEP works for Pólya urns rio oct

80 Pólya urns paradigmatic class of stochastic processes with path dependence self-reinforcement breaks multinomial structure of random processes a) b) rio oct

81 Pólya urns generalize hypergeometric models (sampling without replacement) generalize Bernoulli models (sampling with replacement) self-reinforcing the rich get richer or the winner takes all rio oct

82 Pólya urns are related to... beta-binomial distribution Dirichlet-processes Chinese restaurant problem models in population genetics generalizations of central limit theorem applications in adaptive clinical trials modeling tissue growth path dependent institutional development computer data structures reform resistance of sectors of EU politics aging of alleles and Ewens s sampling formula image segmentation and labeling evolutionary dynamics... rio oct

83 Pólya urn processes initial condition: a i balls of color i = 1,, W initial prior probabilities to draw a ball: q i = a i /A 0, total number of balls initially in the urn: A 0 = i a i drawing without replacement hypergeometric process drawing with replacement multinomial process Replace drawn ball of color i by δ balls of same color Polya urn priors... θ = (q 1,, q W ; A 0, δ, N) λ = δ A 0 rio oct

84 Polya urn processes Let δ > 0. After N trials balls of type i: a i (x) = a i + δk i total number A(N) = a i (x) = A 0 + δn conditional probability of drawing ball of type i at step N + 1 p(i k, θ) = a i(x) A(N) = a i + δk i A 0 + δn, Start with empty sequence: x(0) = [.] Compute probability of sequence and then probability of histogram rio oct

85 some work later Probability to observe histogram k after N trials P (k θ) = ( ) W N i=1 a(δ,k i) i k A (δ,n) 0 with m (δ,r) m(m + δ)(m + 2δ) (m + (r 1)δ) In the limit N we get log P (k θ) = W log p i + 1 γ i=1 W q i log p i, i=1 rio oct

86 Pólya violates practically everything S Polya (p) = W log p i is of class (c, d) = (0, 1) i=1 c = 0 violates SK4 c = 0 violates SK3 S Polya (p) violates SK2. p i = 1/W yields minimum! log(x) is convex and uniform distribution Ok with intuition: self-reinforcing mechanisms drive system away from uniform distribution into rich-get-richer scenarios rio oct

87 Maximum entropy principle still works! Violation of axioms SK2, SK3, SK4 does not invalidate the principle Minimizing Polya-divergence under the usual normalization constraint means ψ S Polya (p) α W p i β i=1 W q i log p i, i=1 with the negative inverse temperature β = 1/γ < 0 Solving the MEP either leads to the solution p i = 1 ζ (q i γ) for 0 < p i < 1 or if this can not be satisfied p i = 0 rio oct

88 Note! For γ < min(q) uniform MEP solution for all i If max(q) > γ > min(q), uniform for i with q i > γ. For others p i = 0 If γ > max(q) for one i, p i = 1 while all others p j = 0 rio oct

89 Frequency distributions of Polya processes assume q i = 1/W MEP prediction is uniform How is p i distributed around 1/W? look at frequency and rank distribution of Polya-processes new variables n z (k) = W χ(k i = z) i=1 χ is characteristic function 1 if argument is true and 0 if false n z (k) number of states that occur with frequency z (π z = n z W ) rio oct

90 Frequency distributions of Polya processes Probability of observing some n = (n 1,, n N ) is P (n θ) = {k n=n(k)} P (k θ) Maximize S Polya (π θ) frequency distribution for uniform priors π z = n z W = 1 ζ 1 W γ + 1) φz(z z + 1 γw (1) where φ = exp( β) and ζ = exp(1 + α)n 1 W γ 1 Rank distribution: For r = 1,, W define intervals [t r+1, t r ] with 1 W = t r t r+1 dz π z f(r) = W tr dz π z z N t r+1 rio oct

91 Frequency distribution of Polya urn process W = 100, δ = 2, uniform initial conditions a i = 1, N = 10 5 trials Note: δ = 1 recovers Poissonian distribution for the multinomial process rio oct

92 Summary Polya MEP entropy is of the (0, d)-class violates SK2, SK3, SK4 but MEP still works no-power laws but something really different rio oct

93 Conclusions I Complex Systems are non-ergodic by nature Interpret CS as those where Shannon axioms 1-3 hold All statistical systems uniquely classified by 2 scaling exponents (c, d) Single entropy covers all systems: S c,d = re i Γ (1 + d, 1 c ln p i) rc All known entropies of SK1-SK3 systems are special cases Distribution functions of all systems are Lambert-W exponentials. There are no other options Phasespace growth determines entropy Maximum entropy principle: exists for path-dependent processes rio oct

94 Conclusions II information theoretic approach leads to (c, d)-entropies If (c, d)-entropies are thermodynamically reasonable deep connection to phase space volume (tells you what systems have what c and d) Extensive and max entropy are both (c, d)-entropies but NOT the same SSR Markovian, non-ergodic exact power laws. Slope determines noise SSR new route to scaling huge applicability Pólya urns toy systems for self-reinforcing dyn. and history dependence Pólya urns violate SK2, SK3, SK4 but MEP holds no power laws, no stretched exponentials, something very different rio oct

95 A note on Rényi entropy It is it not sooo relevant for CS. Why? Relax Khinchin axiom 4: S(A+B) = S(A)+S(B A) S(A+B) = S(A)+S(B) Rényi entropy S R = 1 α 1 ln i pα i violates our S = i g(p i) But: our above argument also holds for Rényi-type entropies!!! ( W ) lim W S(λW ) S(W ) = lim R S = G G i=1 g(p i ) ( ) fg (z) z G 1 (R) R = [for G ln] = 1 rio oct

96 The Lambert-W: a reminder solves x = W (x)e W (x) inverse of p ln p: [W (p)] 1 = p ln p delayed differential equations ẋ(t) = αx(t τ) x(t) = e 1 τ W (ατ)t Ansatz: x(t) = x 0 exp[ 1 τ f(ατ)t] with f some function rio oct

97 Amount of information production in a process Markov chain with states A 1, A 2,... A n transition probabilities p ik probability for being in state l: P l = k P kp kl if system is in state A i then we have the scheme ( ) A1 A 2... A n p i1 p i2... p in Then H i = k p ik log p ik is the information obtained when Markov chain is moving from A i step to the next average this over all initial states: H = i P i H i = i k P ip ik log p ik this is the information production if Markov chain moves one step ahead rio oct

LECTURE 2 Information theory for complex systems

LECTURE 2 Information theory for complex systems Stefan Thurner www.complex-systems.meduniwien.ac.at www.santafe.edu rio summer school oct 28 2015 Lecture 1: What are CS? Lecture 2: Basics in information