It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. Charles Darwin
Natural Selection Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky Charles Darwin, 1859, The Origin of Species 3 key ingredients for adaptation by natural selection Exponential growth of populations Struggle for existence: Limited Capacity for any population Variable, heritable survival and reproduction The unity of life: all species have descended from other species Builds on Malthus, An Essay on the Principle of Population, 1798 Domestic breeding shows hereditary modification is possible Fitness is a characteristic of individuals Natural Selection operates on populations Fitness is defined only for a particular environment, environments always change because species form the selective environments of other species Is survival of the fittest a circular statement? Is natural selection an optimization process?
Natural Selection Natural selection is often slow, but arms races result in complex, wonderful, bizarre (and stupid) things can lead to cooperation (largely) based on the fitness of reproductive individuals Natural selection is not learned behavior passed on group selection (Dawkins: selection acts on genes & on individuals, not groups) Exceptions? There s a lot we don t know about evolution The role of symbiosis & cooperation The right definition of species Darwin did not have a mechanism that allowed for heritable, variable fitness Genes: strings of DNA that get transcribed to RNA, translated to proteins and expressed as phenotype A string of molecular symbols AACCGGTAGTCTATGCTAGTGGGGTTTTAATAAT is turned into a protein that makes your hair brown, curly or fall out when you re 30
Genetics Mendel: showed that genes exist by breeding pea plants genes exist as recessives and dominants, one copy from each parent Given dominant AA mom and recessive aa dad, offspring are all Aa, and look like mom Variation comes from combining genes from mom (BbCCddZz) and dad (bbccddzz) In 1953 Watson & Crick & Rosalind Franklin discover the molecular structure of DNA DNA the molecule that carries the heritable information Mutations, sex, crossing over in DNA provide the variation Every cell in your body has 30,000 bp of DNA that is transcribed into RNA and translated into proteins Proteins do all the work: Make your eyes blue, your hair curly, your muscles strong, your heart pump DNA is arranged into genes on chromosomes Humans have 23 chromosomes, 2 copies each (46) Fits by supercoiling: 2-3m DNA / cell, your DNA goes to moon and back 70 times!
What mechanisms allow for heritable, variable fitness? Heritable Genes: encoded in DNA, transcribed to RNA, translated to proteins whose expression determines fitness Variable Mutations--copies are not perfect Sex genes are combined from 2 parents Crossing over allows for many different possible combinations
DNA DNA = Deoxy-ribonucleic acid Unit: nucleotide Sugar ring with a base (A, T G, C) and phosphate group Base pairing A-T, G-C Every cell in your body has 30,000 bp of DNA that is transcribed into RNA and translated into proteins DNA is arranged into genes on chromosomes Humans have 23 chromosomes, 2 copies each Fits by supercoiling: 2-3m DNA / cell, your DNA goes to moon and back 70 times! Adenine, Thymine Cytosine, Guanine
The Central Dogma
RNA codon table 4 bases, 3 per codon = 4 3 codons = 64 codons 20 amino acids (redundancy is possible) This table shows the 64 codons and the amino acid each codon codes for. The direction is 5' to 3'. Ala/A GCU, GCC, GCA, GCG Leu/L UUA, UUG, CUU, CUC, CUA, CUG Arg/R CGU, CGC, CGA, CGG, AGA, AGG Lys/K AAA, AAG Asn/N AAU, AAC Met/M AUG Asp/D GAU, GAC Phe/F UUU, UUC Cys/C UGU, UGC Pro/P CCU, CCC, CCA, CCG Gln/Q CAA, CAG Ser/S UCU, UCC, UCA, UCG, AGU, AGC Glu/E GAA, GAG Thr/T ACU, ACC, ACA, ACG Gly/G GGU, GGC, GGA, GGG Trp/W UGG His/H CAU, CAC Tyr/Y UAU, UAC Ile/I AUU, AUC, AUA Val/V GUU, GUC, GUA, GUG START AUG STOP UAG, UGA, UAA
Strings of amino acids Proteins Primary, secondary and tertiary structure Proteins do all the work but 99% of human DNA is not translated into protein Why carry around all that junk Some is not expressed in some cells or conditions Some is evolutions play ground
Variation in DNA How can the genetic content of a strand of DNA change? Mutagens many types of direct mutations UV, particle radiation, oxygen radicals, other chemicals Sex (Mendelian genetics) Chromosomal crossing over during meiosis Gene exchange via gene transfer in bacteria Viral DNA insertion and exchange (viruses do not have cellular machinery to reproduce their genomes, so use ours mistakes happen) Many ways we don t understand
Crossing over Each cell has 2 copies of every gene, but sperm and eggs each have 1. The process of creating 1 from 2 is meiosis (with crossing over) In sexual reproduction Mom:AAACATCCGTTAA (tall, blue eyes, no toe hair) ----->AAACATTCCGGA ---> tall, brown eyes, hairy toes Dad: AGGCCTTCCGGAA (short, brown eyes, hairy toes) A new offspring is created by combining 1 chromosome from an egg and 1 from a sperm
Summary: Genetics & Natural Selection 3 key ingredients for adaptation by natural selection Exponential growth of populations Struggle for existence: Limited Capacity for any population Variable, heritable survival and reproduction Genetics: A discrete 4 letter alphabet (AGCT), packaged into genes, that code for proteins Variation and Heredity Letters can change: mutations, insertions, deletions Chromosomes crossover to create sperm & eggs Sperm and eggs combine to make new offspring Each cell has the same DNA In a tremendously complicated process DNA is transcribed into RNA and RNA is translated into proteins that cause phenotype
4 billion years ago A proto-bacteria made a copy of itself A long time (bacteria can reproduce in 20 minutes) A lot of individuals A very good (and inevitable) process Massively parallel search Partial solutions are conserved Arms races Molecules for storing info, processing info, doing work Result: You, me & billions of species Discussion: natural selection as a complex adaptive system
DNA:.ATG GCT GTT CAG TAG CGT.. RNA: AUG GCU GUU CAG UAG CGU Protein: Met Ala Val Gin Stop Arg
Key Concepts Discussion Introduction to The Origin of Species The Central Dogma
Genetic Algorithms Principles of natural selection applied to computation: Variation Selection Inheritance Evolution in a computer: Individuals (genotypes) stored in computer s memory as bit strings Evaluation of individuals (artificial selection) Differential reproduction through copying and deletion Variation introduced by analogy with mutation and crossover
Genetic Algorithms Initialize a population, P Repeat Create an empty population, P Select 2 individuals from P based on fitness Apply mutation, mating, crossover Add the individuals to P Set P = P P at T n P P at T n+1
Define the individuals (string of bits or letters, representation matters) Define a fitness function that evaluates the string Define the rules for selecting individuals (e.g. roulette or tournament) mutation (e.g., some % of bits flip) mating (usually 2 parents) cross over (e.g., probability of crossing over at each position; 1 point, 2 point, n point)
Fitness functions Raw fitness: f raw = % of correct bits, selective pressure decreases as answer gets closer Scaled fitness: 2 f raw One more correct letter is twice as fit (only works in simple cases) Normalized fitness: fitness divided by the average fitness in the population Selection Methods Fitness proportionate: roulette wheel probability of appearing in P is proportional to normalized fitness Tournaments: pick (usually) 2 individuals from P, compare fitness, put more fit individual in P. Sample with replacement. Elitism guarantee best x solutions will appear in P Implicit fitness in agent based models ability to reproduce determines fitness
Simple example: evolve a string Find the string Furious green ideas sweat profusely There are 27 35 possible strings GA: 500 strings, crossover rate 75%, mutation rate 1% Time, avg f, best f, Best string 0,.035,.20 pjrmrubynrksiidwctxfodkodjjzfunpk 1,.070,.26 pjrmrubynrksxiidnybvswcqo piisyexdt 26,.72,.80, qurmous green idnasvsweqt prifuseky 42,.90,.97, qurious green ideas sweat profusely 46.94, 1, curious green ideas sweat profusely Massively parallel directed search is effective when there is 1 correct answer.
How big a search has life conducted on earth? A combinatorial optimization problem (sort of) How many bacteria on earth: 10 x How many days would it take to produce that many bacteria from a single cell: 10 y How many bacteria could have been produced in 3.5 billion years, in an infinite world: 10 z x, y & z are integers between 1 and 1 quadrillion -Initial guesses from each group form P (an x,y,z triplet) -I will eliminate least fit guesses from P -If you remain in P, your next guess, is your last guess with up to 1 mutation & 2 crossings over with other guesses still in P -If you were eliminated your guess must be formed from up to 1 mutation and 2 crossings over from 2 members of P Pt=2 Pt=3, fitness P t=1 -Valid mutations: 30.500.600000 5K.1tril. 100tril, F=2 add/subtract a 1 or 300.1000.100000 10.1 mil. 1bil, F=7 multiply/divide by 10 251.50000.600000 1000,1mil. 1bil,F=5 (90 can go to 90, 9, 900, 89 or 91) 2510.5000.60000 10K.10mil.10bil F = 5 251.5000.60000 50.500.60000 2K.5K.10 F = -1 250.500.60,000 300.1000.60000 1000. 1 bil. 1 quad F=0 30.1000.60,000
Whitman et al PNAS 1998 estimate 5 10 30 bacteria on the planet Events that would occur once in 10 billion years in the laboratory would occur every second in nature. 1 x 10 3 bacterial generations per year (1 every 3 days) 3.5 x 10 9 years of evolution ~10 43 bacteria have lived on earth How long to produce 10 30 bacteria: 150 days~= 10 2 days How many bacteria could have been produced in 3.5 billion years? 3 trillion generations 2 3,000,000,000,000 bacteria = 10 900 billion bacteria
Reading for Monday Genetic Algorithms: Principles of Natural Selection Applied to Computation Stephanie Forrest, Science, Vol. 261, No. 5123. (Aug. 13, 1993), pp. 872-878. Grey codes, schema, function & combinatorial optimization, selecting good parameters/rules for your GA TALK ON SATURDAY, 530 pm, Hibben 105 Professor Steve Lansing (just south of the Anthropology Department) "Perfect Order: Recognizing Complexity in Bali
Guidelines for implementing GAs Define the individuals (string of bits or letters, representation matters) Define a fitness function that evaluates the string (explicitly or implicitly, e.g. in an ABM) Define the rules for selecting individuals (e.g. roulette or tournament, often with elitism) mutation (e.g., some % of bits flip) mating (usually 2 parents) cross over (e.g., probability of crossing over at each position; 1 point, 2 point, n point) Parameters (rough guidelines from DeJong 1975 GA experiments on a particular suite of problems): Bitstring length: 32-10,000 Population size: 100-1000 Length of run: 50-10,000 Single point crossover rate: 0.6 per pair of parents Mutation rate: 0.005 per bit
GA evolved GA parameters Grefenstette 1986 found smaller populations with more crossover & mutation maximized average fitness Population: 30 Crossover: 0.95 Mutation: 0.01 Elitism Schaffer Caruana, Eschelman & Das 1989 Bigger test set of numerical optimization problems, grey coding Population: 20-30 Crossover: 0.75-0.95 Mutation: 0.005-0.01
Evolving parameters over time Davis 1989, 1991: Let mutation & crossover rates evolve Allows values of operators that improve fitness to become more common in the population Operators have fitness based on fitness of individuals containing that operator or descended from such individuals An operator, O, is chosen proportional to its fitness to create a new individual, i, that replaces an unfit individual If F(i) > current F max, the fitness of O i is incremented. Fitness of O p(i), O p(p(i)) (parents & grandparents, etc. of i) are also incremented. Improves GA performance on some problems, including evolving weights on neural nets How well does the population of operators represent current usefulness? Depends on the parameters
Goldberg et al 1989, 1990, 1993 Messy GAs Improve function optimization by building long strings from well-tested building blocks Example simple function optimization GA F(x,y) = yx 2 -x 4 Represent x and y as 3 bit grey coded strings E.g., F(001111) = F(1,5) = 5*1 2-1 4 = 4 Problem: can t increase string length to keep building blocks Solution: encode position in the representation as (position, value) Individual above: {(1,0),(2,0),(3,1),(4,1),(5,1),(6,1)} Another individual: {(3,1),(3,0),(3,1),(4,0),(4,1),(3,1),(2,0),(1,1)} How do you evaluate the fitness of this individual?
Messy GAs cont. Another individual: {(3,1),(3,0),(3,1),(4,0),(4,1),(3,1),(2,0),(1,1)} 1. Use leftmost assignment for each position 2. Fill in unspecified bits with * Evaluated as {(1,1),(2,0),(3,1),(4,0),(5,*),(6,*)} 1010** (a schema) Now you have to evaluate the fitness of a schema F(1010**) Use Competitive Templates First use the GA to find a string S, that is a local optimum (for which n bit flips do not improve fitness) Second, use bits from S to fill in unspecified bits to evaluate: E.g. if S = 110010 evaluate F(101010)
2 Phases Messy GAs continued Primordial: create a population of small, fit simple strings List all schema with k specified bits, for a string length l. For k = 3, l = 6 {(1,0),(2,0),(3,0)} {(1,0),(2,0),(3,1)} {(1,1),(2,1),(3,1)} {(1,0),(2,0),(4,0)} {(4,1),(5,1),(6,1)} Use selection to evaluate fitness (w/o mutation or crossover) to cull population Juxtapositional: splice and dice the small strings in P E.g. splice first and 4 th strings above: {(1,0),(2,0),(4,0), (4,1),(5,1),(6,1)} E.g. cut to create {(1,0),(2,0)} and {(4,0), (4,1),(5,1),(6,1)} Evaluate fitness of these new strings
Problems with messy GAs The biggest problems are specifying k & evaluating all schemas with k specified bits: n = 2 k l!k The initialization bottleneck is evaluating all those schemas There are some methods to reduce the necessary evaluations, but This technique only works well when k is small, e.g. when low order schema are good enough to solve the posed problem
How complicated is the Biological GA? Genotype: 4 letter alphabet triplet codons amino acids (with redundancy) folds into protein with unpredictable structure & phenotype varies with cellular environment Variation by mutation, insertion, deletion: each copy has potential for small and large effect on phenotype Position of genes doesn t matter* but position of codons matters Genes interact in gene regulatory networks