Lecture IV A. Shannon s theory of noisy channels and molecular codes

Similar documents
In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

The Physical Language of Molecules

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Casting Polymer Nets To Optimize Molecular Codes

CODING A LIFE FULL OF ERRORS

Using an Artificial Regulatory Network to Investigate Neural Computation

Genetic Code, Attributive Mappings and Stochastic Matrices

Edinburgh Research Explorer

Reducing Redundancy of Codons through Total Graph

A modular Fibonacci sequence in proteins

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Genetic code on the dyadic plane

Biology 155 Practice FINAL EXAM

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

A p-adic Model of DNA Sequence and Genetic Code 1

A Minimum Principle in Codon-Anticodon Interaction

CHEMISTRY 9701/42 Paper 4 Structured Questions May/June hours Candidates answer on the Question Paper. Additional Materials: Data Booklet

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II)

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon

THE GENETIC CODE INVARIANCE: WHEN EULER AND FIBONACCI MEET

ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES *

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

Analysis of Codon Usage Bias of Delta 6 Fatty Acid Elongase Gene in Pyramimonas cordata isolate CS-140

The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. Sergey V. Petoukhov

Practical Bioinformatics

Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino Acids

Crystal Basis Model of the Genetic Code: Structure and Consequences

Ribosome kinetics and aa-trna competition determine rate and fidelity of peptide synthesis

Foundations of biomaterials: Models of protein solvation

The Genetic Code Degeneracy and the Amino Acids Chemical Composition are Connected

Abstract Following Petoukhov and his collaborators we use two length n zero-one sequences, α and β,

In Silico Modelling and Analysis of Ribosome Kinetics and aa-trna Competition

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Get started on your Cornell notes right away

Advanced topics in bioinformatics

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

In Silico Modelling and Analysis of Ribosome Kinetics and aa-trna Competition

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

PROTEIN SYNTHESIS INTRO

Supplementary Information for

2013 Japan Student Services Origanization

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Crick s early Hypothesis Revisited

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

RIBOSOME: THE ENGINE OF THE LIVING VON NEUMANN S CONSTRUCTOR

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Slide 1 / 54. Gene Expression in Eukaryotic cells

SUPPLEMENTARY DATA - 1 -

An algebraic hypothesis about the primeval genetic code

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Chemistry Chapter 26

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Gene Finding Using Rt-pcr Tests

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

DO NOT OPEN THE EXAMINATION PAPER UNTIL YOU ARE TOLD BY THE SUPERVISOR TO BEGIN

Gambling and Information Theory

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

NIH Public Access Author Manuscript J Theor Biol. Author manuscript; available in PMC 2009 April 21.

Genome and language two scripts of heredity

Supporting Information

The Trigram and other Fundamental Philosophies

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Number-controlled spatial arrangement of gold nanoparticles with

Six Fractal Codes of Biological Life: Perspectives in Astrobiology and Emergence of Binary Logics

arxiv: v2 [physics.bio-ph] 8 Mar 2018

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Electronic supplementary material

Timing molecular motion and production with a synthetic transcriptional clock

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Transcription Attenuation

SYMMETRIES, GENERALIZED NUMBERS AND HARMONIC LAWS IN MATRIX GENETICS

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Journal of Biometrics & Biostatistics

Recent Evidence for Evolution of the Genetic Code

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Why do more divergent sequences produce smaller nonsynonymous/synonymous

A Brief History of Life on Earth

Fundamental mathematical structures applied to physics and biology. Peter Rowlands and Vanessa Hill

Supplemental Figure 1.

Codon Distribution in Error-Detecting Circular Codes

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Information Theory and Networks

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Molecular Evolution and Phylogenetic Analysis

Week 6: Protein sequence models, likelihood, hidden Markov models

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

From DNA to protein, i.e. the central dogma

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Transcription:

Lecture IV A Shannon s theory of noisy channels and molecular codes

Noisy molecular codes: Rate-Distortion theory S Mapping M Channel/Code = mapping between two molecular spaces. Two functionals determine the fitness of the code: Fitness = Rate(map) + Distortion(map). Mapping becomes non-random at a coding transition. Topological aspects.

Geometry of molecular information channels M S Marble packing (A) Max colors. (B) Same\similar color of neighbors.

Genetic Code The genetic code is main info channel of life S DNA 4 letters..acggagguaccc. M Protein 20 letters Thr Glu Val Pro Genetic code: translates 3-letter words in 4-letter DNA language (4 3 = 64 codons) to protein language of 20 amino acids. Proteins are amino acid polymers. Diversity of amino-acids is essential to protein functionality.

The genetic code maps codons to amino-acids Molecular code = map relating two sets of molecules (spaces, languages ) via molecular recognition. Spaces defined by similarity of molecules (size, polarity etc.) amino 20 amino-acids acid trna UUU UUA 64 codons UCU UCA UAU UAA UGU UGA codon UUC UUG UCC UCG UAC UAG UGC UGG CUU CUA CCU CCA CAU CAA CGU CGA Genetic Code CUC AUU CUG AUA CCC ACU CCG ACA CAC AAU CAG AAA CGC AGU CGG AGA AUC AUG ACC ACG AAC AAG AGC AGG GUU GUA GCU GCA GAU GAA GGU GGA GUC GUG GCC GCG GAC GAG GGC GGG

The genetic code is a smooth mapping Amino-acid polarity M S Degenerate (20 out of 64). Compactness of amino-acid regions. Smooth (similar color of neighbors). Generic properties of molecular codes?

Distortion Challenges of molecular codes: rate and distortion Noise, crowded milieu. Competing lookalikes. Weak recognition interactions ~ k B T. Need diverse meanings. Synthesis of reliable organisms from unreliable components (von Neumann, Automata Studies 1956) Rate How to construct the low-rate molecular codes at minimal cost of resources? Rate-distortion theory (Shannon 1956) D Goodsell

Fitter codes have minimal distortion Amino-acids α Encoder E Codons i Distortion C Reader R ω Decoder D j trna Q C Ppath C Tr E R DC paths Distortion of noisy channel, Q = average distortion of AA. R defines topology of codon space. C defines topology of amino-acid space. (TT, J Theo Bio 2007, PRL 2008, PNAS 2008)

amino-acid Smooth codes minimize distortion Noise confuses close codons. Smooth code: close codons = close amino-acids. minimal distortion. Optimal code must balance contradicting needs for smoothness and diversity. Max smoothness Min diversity Min smoothness Max diversity 1 64 20 # amino-acids Marble game (TT, Bio Phys 2008)

Channel rate is code s cost Diverse codes require high specificity = high binding energies ε. Cost ~ average binding energy < ε >. Binding prob. ~ Boltzmann: E ~ e ε/t. I E ln E, i i i i E α Encoder E i Cost I = Channel Rate (bits/message)

Code s fitness combines rate and distortion of map Fitness = Gain x Distortion + Rate H Q I Gain β increases with organism complexity and environment richness. Fitness H is free energy with inverse temperature β. Evolution varies the gain β. Population of self-replicators evolving according to code fitness H: mutation, selection, random drift.

Code emerges at a critical coding transition Low gain β : Cost too high Rate I no specificity no code. Code emerges when β increases: channel starts to convey information (I 0). Coding transition Continuous phase transition. Emergent code is smooth, low mode of R. Distortion Q no-code code codes Rate-distortion theory (Shannon 1956)

Emergent code is a smooth mode of error-laplacian Lowest excited modes of graph-laplacian R. Single maximum for lowest excited modes (Courant). Every mode corresponds to amino-acid : # low modes = # amino-acids. single contiguous domain for each amino-acid. Smoothness.

Probable errors define the graph and the topology of the genetic code Codon graph = codon vertices + 1-letter difference edges (mutations). K 4 X K 4 X K 4 A T G C X A T G C A X T G C AGG AAG CAG TGA AGA CAA CCA Non-planar graph (many crossings). TAA TTA ATA ATC AAC AAA GAC GAA ACA ACT AAT GAT A Genus γ = # holes of embedding manifold. Graph is holey : embedded in γ = 41 (lower limit is γ = 25)

Coloring number limits number of amino-acids Q: Minimal # colors suffices to color a map where neighboring countries have different colors? A: Coloring number, a topological invariant (function of genus): 1 ( ) 7 1 48 chr. 2 (Ringel & Youngs 1968) chr(41) 25 chr(25) 20 max(# amino-acids) chr( ) From Courant s theorem + convexity (tightness). Genetic code: γ = 25-41 coloring number = 20-25 amino-acids (TT J Lin Alg 2008)

The genetic code coevolves with accuracy A path for evolution of codes: from early codes with higher codon degeneracy and fewer amino acids to lower degeneracy codes with more amino acids. 1 st 2 nd 3 rd chr # 1 4 1 0 4 2 4 1 1 7 4 4 1 5 11 4 4 2 13 16 4 4 3 25 20 4 4 4 41 25

Lecture IV B Growth rate as entropy rate: Kelly s horse race

Horse race basics Lucky Star: Odds: 2:1 White light: Odds: 3:1 Sea biscuit Odds: 6:1

Horse race basics (cont.) Problem: You dedicate 100$ for gambling, you intend to reinvest the money over and over again, what is the optimal strategy? Kelly s idea: try to optimize asymptotic growth rate.

Asymptotic growth Let W(n) be your wealth after n bets. Let W(0) be you initial wealth Growth rate 1 W( n) log2 n W(0) Asymptotic growth rate lim n 1 W ( n log ) 2 n W(0)

Constant rebalancing Each race has a random outcome X drawn from the distribution P(X) assumed constant ( t P=0). The percentage of money placed on the i-th horse of the n-th round is W(n-1)*b(i). The amount gained : W(n)=O(x) W(n-1) b(x)

Constant rebalancing After N such trials (with rebalancing) : W ( n) W (0) O( X ) b( X )... O( X ) b( X ) 1 1 N N So Wn ( ) log2 log 2 O( X1) b( X1)... log 2 O( X N) b( X N) W(0) Since X is memoryless and P(X) is constant we obtain for N>>1 O X b X O X N b X N ( )log ( ) ( )... ( )log ( ) ( ) log ( ) ( )... log ( ) ( ) 2 1 1 2 N P X O X b X P X O X b X 1 2 1 1 N 2 N N

Asymptotic growth rate Conclusion so far 1 W( n) log2 n W(0) 1 log 2 O ( X1 ) b ( X1 )... log 2 O ( X n ) b ( X n ) n n i p i log ( O b ) 2 i i

Optimal strategy if P(X) is known Suppose we know the probability of winning for all the horses p i. What is the optimal bet-hedging strategy? i This strategy is termed proportional betting. p log 2 O b b i i i i i p i 0 b b i i b i p i

Example: Two horses Odds: 2:1 (double or nothing), i.e. O 1 = O 2 = 2. Let p 1 be the probability the 1 st horse will win and b 1 the portion of the wealth that placed on this 1 st horse. Let s plot the asymptotic growth rate:

Example : a race with 2 horses double or nothing Min-max saddle point

The saddle point is A zero-sum game against nature The asymptotic growth rate ( b,p) The game is: I choose b / nature choose p ( b,p) What is the minimal growth I can assure if nature is evil? Answer: min-max solution b mnmx p mnmx j 1 i 1 j O O In the example shown b mnmx p mnmx 1 2

Growth rate in horse race ( b, p) D( p p ) D( b p) v mnmx D(p q) relative (KL) entropy 1 st term - pessimists surprise (free lunch). 2 nd term - distance from optimum (note the sign). 3 rd term - game value.

Side information Race at LA, bookie in NY and I have a friend in the telegraph company Perfect side information = exponential growth. What about partial information?

Side information (cont.) Informer says horse j will win. The probability for i-th to win given the side information that the j-th horse will win is p i j The adjusted portfolio is b i j

Optimal betting with side information ( b, p) p p log ( O b ) i, j j i j 2 i i j D( p p ) I( X ; Y ) p D( p b ) v mnmx j j j i, j I( X ; Y ) is the mutual information between the informer and us (X horse, Y side information). Kelly s famous result retrieved at optimality: Optimal gain of capital = Channel Capacity

So why study horse races? Biology Only manifestation of channel capacity without an explicit code. Cells ~ money, Phenotype ~ betting, nature s state ~ winning horse, Portfolio = phenotype distribution side Info. = sensing BUT: (i) Suboptimal phenotype immediate ruin. (ii) P=P(t) (non-stationary).

Generalized Kelly (Main result) ( b, p) p p log ( O b ) i, j j i j 2 ik k j k Given that the sum of O -1 columns is positive, Λ decomposes to a sum of entropies: new game value 1 mnmx j j j i, j D( p p ) I( X ; Y ) p D( p S b ) v Free lunch term Side info. Channel rate Penalty term: distance between actual p( j) to the p( j) you happen to be optimal for.

Generalized Kelly (Main result) cont. Iff b opt (p) > 0 then b opt (p) =S p. S 1 ij 1 ij 1 ij O O is a stochastic matrix. j S -1 b ( j) is the conditional environment probability that b ( j) is optimal for in the adjusted game. So the penalty term : = average loss due to the sub-optimal response to the side information. i, j 1 j j S b j p D( p )

Environment is non-stationary Slow changes: Λ(p(t),b(t)) is meaningful Problem: given K phenotypic switchings allowed within [0,T] find optimum switching strategy (when and to what)

When P=P(t) Our solution: when? : equipartition of loss criterion to what?: adjusted time average

Monte-Carlo (binary symmetric channel) Optimal sol. with P error =1/3 P(t)=cos(2πt/35) 2O 11 =O 22 =1 O 21 =0,O 12 =1

Conclusion: If there is a dilemma an increase of 1 bit in the side information rate can potentially increase the doubling rate by 1 bit/generation.