Molecular Population Genetics

Similar documents
Nucleotide substitution models

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Processes of Evolution

Genetical theory of natural selection

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

7. Tests for selection

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

(Write your name on every page. One point will be deducted for every page without your name!)

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Febuary 1 st, 2010 Bioe 109 Winter 2010 Lecture 11 Molecular evolution. Classical vs. balanced views of genome structure

The neutral theory of molecular evolution

D. Incorrect! That is what a phylogenetic tree intends to depict.

Understanding relationship between homologous sequences

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Natural Selection results in increase in one (or more) genotypes relative to other genotypes.

Application Evolution: Part 1.1 Basics of Coevolution Dynamics

Genetics and Natural Selection

Problems for 3505 (2011)

Darwinian Selection. Chapter 7 Selection I 12/5/14. v evolution vs. natural selection? v evolution. v natural selection

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Full file at CHAPTER 2 Genetics

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Functional divergence 1: FFTNS and Shifting balance theory

Bioinformatics Exercises

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

Genomes and Their Evolution

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Neutral Theory of Molecular Evolution

Lecture Notes: BIOL2007 Molecular Evolution

Theory a well supported testable explanation of phenomenon occurring in the natural world.

BIG IDEA 4: BIOLOGICAL SYSTEMS INTERACT, AND THESE SYSTEMS AND THEIR INTERACTIONS POSSESS COMPLEX PROPERTIES.

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

- mutations can occur at different levels from single nucleotide positions in DNA to entire genomes.

REVIEW 6: EVOLUTION. 1. Define evolution: Was not the first to think of evolution, but he did figure out how it works (mostly).

AP Biology Review Packet 5- Natural Selection and Evolution & Speciation and Phylogeny

Outline of lectures 3-6

19. Genetic Drift. The biological context. There are four basic consequences of genetic drift:

The origin and maintenance of genetic variation

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Is there any difference between adaptation fueled by standing genetic variation and adaptation fueled by new (de novo) mutations?


Mechanisms of Evolution Microevolution. Key Concepts. Population Genetics

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Enduring Understanding: Change in the genetic makeup of a population over time is evolution Pearson Education, Inc.

Evolution of Populations. Chapter 17

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Big Idea #1: The process of evolution drives the diversity and unity of life

Natural Selection. DNA encodes information that interacts with the environment to influence phenotype

Notes for MCTP Week 2, 2014

Selection and Population Genetics

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

How does natural selection change allele frequencies?

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

There are 3 parts to this exam. Use your time efficiently and be sure to put your name on the top of each page.

ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG Human Population Genomics

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

How robust are the predictions of the W-F Model?

Outline of lectures 3-6

Effective population size and patterns of molecular evolution and variation

mrna Codon Table Mutant Dinosaur Name: Period:

EVOLUTION UNIT. 3. Unlike his predecessors, Darwin proposed a mechanism by which evolution could occur called.

Population Genetics I. Bio

Population Genetics of Selection

Evolutionary Analysis of Viral Genomes

Reproduction- passing genetic information to the next generation

Population Genetics & Evolution

Quantitative Genetics & Evolutionary Genetics

Evolution. Changes over Time

EVOLUTIONARY DISTANCES

Cladistics and Bioinformatics Questions 2013

Mathematical modelling of Population Genetics: Daniel Bichener

Evolution PCB4674 Midterm exam2 Mar

FUNDAMENTALS OF MOLECULAR EVOLUTION

Unit 9: Evolution Guided Reading Questions (80 pts total)

Reproduction and Evolution Practice Exam

Genetic Variation in Finite Populations

Unit 7: Evolution Guided Reading Questions (80 pts total)

Sequence Divergence & The Molecular Clock. Sequence Divergence

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

The Wright-Fisher Model and Genetic Drift

Biology Semester 2 Final Review

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Population Structure

8/23/2014. Phylogeny and the Tree of Life

BIOINFORMATICS: An Introduction

Conservation Genetics. Outline

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

Classical Selection, Balancing Selection, and Neutral Mutations

Introduction to population genetics & evolution

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Transcription:

Molecular Population Genetics The 10 th CJK Bioinformatics Training Course in Jeju, Korea May, 2011 Yoshio Tateno National Institute of Genetics/POSTECH

Top 10 species in INSDC (as of April, 2011)

CONTENTS 1. Evolution of organisms 2. Evolution of genes 3. Genes and alleles 4. Gene (allele) frequency 5. Natural selection 6. Gene frequency change over time 7. Evolution of MHC genes

We conclude that photosynthetic organisms had evolved and were living in a stratified ocean supersaturated in dissolved silica 3,416 million (3.4 billion) years ago. M. M. Tice and D. R. Lowe ( Stanford University, USA) Nature 431: 549-552, 2004

In bioinformatics, we deal with large quantities of DNA, RNA and protein sequences, gene expression, proteomics and pathways data. The scientific background of dealing with such biological data is evolution or molecular evolution. This is because all the biological data are the products of evolution. For example, BLAST is useless without this conception.

Genome 1. is to design and control life activity in individuals, 2. is to be inherited to offspring - evolution. Therefore, it is important to recognize that genes, proteins and organisms are products of evolution, and pay attention to the evolutionary view when studying them.

Usage of Genome Data 1. To deduce the function of a DNA sequence by comparing to others whose functions are known, 2. To know the evolutionary origin and process of a gene or a species by constructing a phylogenetic t related (orthologous) genes, 3. To examine if a particular gene exists in a particular species by constructing a phylogenetic tree of the species and related species, 4. To find a route of virus or bacterial infections, 5. To understand regulation of gene expression (CAGE microrna), and more..

Evolution is quantitative and qualitative changes in genes over time. Evolution is primarily driven by mutation. No mutation, no evolution.

Mutation 1. Point Mutation One nucleotide (base) is replaced by another. 2. Insertion One or more nucleotides are inserted into the extant sequene. 3. Deletion One or more nucleotides are deleted from the extan t sequence. 4. Inversion A sequence o f two or more nucleotides is replaced in the opposite direction to the extant sequence. 5. Recombination A sequence in a chromosome is exchanged with another in the other homologous chromosome. 6. Translocation A sequence in a chromosome is moved and located in another chromosome.

Spontaneous mutations Natural miss-pairings between the two nucleotides

Gene Evolution by Mutation A T G gene 0 mutatio n G G G back mutation A T mutatio n A A A G G mutatio n T T t1 t2 t3 t4 time

Poisson Process The first term on the right of the equation represents the probability that one new event added during t to the state that n -1 events had occurred by time t. The second term means that no event was added during t to the probability that n events had occurred by time t. dp 1 (t) dt This process is characterized by that the chance of a new event in any short interval is independent, not only of the previous states of the system in question, but also of the present state. The process is called a random process. We can thus assume that the chance of a new addition to the total account during a very short interval ( t) is written as t where is a rate of the addition ignoring the chances of two or more new additions. If we define Pn (t t) as the probability that the exactly n events have occurred by time t t, we can derive the following equation. P n (t t) P n 1 (t) t P n (t)(1 t) (1) Formula (1) is rewritten as P n (t t) P n (t) P n 1 (t) P n (t) t This difference equation can be approximated as the following differential equation. dp n (t) dt P n 1 (t) P n (t) (2) For n 0, we have, dp 0 (t) P 0 (t) (3) dt The solution of (3) is given by, ln P 0 (t) t c, or P 0 (t) ce t Since P 0 (0) 1, P 0 (t) e t For n 1, (2) is expressed as (4) P 0 (t) P 1 (t) e t P 1 (t) (5 If we remember t f t g t f t g t f t g t, and dt d ae at, dt eat we can rewrite the right of (5) as t e t Therefore, t e t. P 1 t te t. (6) Repeating this procedure, we can get n P n t e t t, (7) n! which is the general expression of Poisson probability.

Random events in the time axis means that they occur constantly over time, or proportionally to time.

From my thesis, 1978

Population Genetics and Molecular Evolution Study of gene frequency change over time in a population. Evolution is defined as gene frequency change over time.

W. Bateson (1902) Alleles A/A: homozygote A/a: heterozygote

In 1955 Tjio and Levan first reported that humans had 23 pairs of chromosomes. The chromosomes of humans, 2n = 46, duploid < monoploid.

Gene Frequency

Quiz 1 Two equal-sized populations, 1 and 2, have frequencies p 1 and p 2 of the allele A. The populations are fused into a single randomly mating unit. (1) What is the frequency of AA homozygotes in the mixed populations? (2) What is the answer to (1), if population 1 is 4 times as large as population 2?

Four Major Evolutionary Factors Causing Gene Frequency Change 1.Natural Selection 2.Mutation 3.Random Genetic Drift 4.Migration

自然選択と適応度 (Natural Selection and Fitness) Frequency of A gene:p Frequency of a gene:q (p + q = 1) Genotype AA Aa aa Frequency p 2 2pq q 2 Fitness w 11 =1 w 12 =1- s 1 w 22 =1- s 2 Genotype 遺伝子型 s 1, s 2 are selection coefficients and range in the following regions. 0 < s 1 <1, 0 < s 2 <1 AA can leave ralatively w 11 p 2 offspring in the next generation. Aa can leave ralatively 2w 12 pq offspring in the next generation. aa can leave ralatively w 22 q 2 offspring in the next generation. When s 1 s 2 0, the three genotypes take the same fitness, and no selection operates; A and a are neutral genes.

Dominance and Recessiveness of Alleles

Gene frequency Gene frequency at the next generation: p' Gene frequency w 11 p 2 + w 12 pq p' = w 11 p 2 + 2w12 pq + w 22 q 2 Difference in gene frequency change between the two generations: p (1) pq{p(w 11 - w 12 ) + q(w 12 - w 22 )} p = p' - p = w 11 p 2 + 2w 12 pq + w 22 q 2 (2) 1)Codominance Selection or Genic Selection A A A a a a 1 1 - s 1-2s ( 0 < s < 1 ) 1 spq p = > 0 1-2sq 2)Overdominance Selection A A A a a a 1 - s 1 1 - t ( 0 < s, t < 1 ) (3) t - (s + t)p p = 1 - sp 2 - tq 2 (4) ^ t p = 0 p = (5) s + t ^ 1 p :equilibrium gene frequency 遺伝子頻度 遺伝子頻度 ^ p 0 Time 時間 0 Time 時間

Quiz 2 Confirm formulas 3, 4 and 5 in the previous slide.

An Example of Over-dominance Selection Sickle-cell Anemia This is caused by a mutation at the 6th residue of the beta chain GAG (Glu) ー > GTG (Val) Upper: Normal Lower: Sickle-cell

Sickle cell and malaria Erythrocytes of the heterozygote for the sickle cell mutation is a less favorable environment for malaria parasites than those of the normal homozygote. Therefore, 1. The normal homozygote is advantageous over the heterozygotes in no malarial regions. 2. The heterozygote is advantageous over the normal homozygote in malarial regions - over-dominance. This indicates that the advantage or disadvantage of a genotype depends on the environments.

Gene frequency change over time by mutation

Four Major Evolutionary Factors Causing Gene Frequency Change 1.Natural Selection 2.Mutation 3.Random Genetic Drift 4.Migration

Gene Frequency Change by Random Genetic Drift

Fixation

Fixation Probability Approximation by diffusion equation As time gets infinite, the process becomes equilibrium Fixation probability : u (p)

Polymorphism 1. 1) Degree of polymorphism per locus (Heterozygosity per locus) 2) Heterozygosity for multiple loci 2. For DNA sequences Nucleotide diversity (Nei & Li 1979) n h 1 p i 2 i 1 p i : gene frequency of the i - th allele n : number of the alleles in the locus H 1 N i j N i 1 h i h i : heterozygosity of the i- th locus N : number of the loci p i p j s ij p i : frequency of the i- th sequence s ij : difference between the i - th and j- th sequences per nucleotide Quiz 3: What is the nucleotide diversity in the following case where the total number of the nucleotides is 100 for each sequence, and there is no other defference than given here. 1 2 3 A A C T G T

The rate of evolution (k) in case of neutral genes or mutations k = 2Nv x (1/2N) = v 2Nv: number of mutant genes 1/2N: fixation probability k is constant over time if v is.

Motoo Kimura

Neutral mutations vs Selective mutations Neutral mutations occur so that the protein product is not affected by them; ex. the 3rd position of a codon that will not change the corresponding amino acid. Synonymous mutations Selective mutations occur so that the protein product is affected by them; ex. the 1st and 2 nd positions of a codon that will change the corresponding amino acid. Non-synonymous mutations

Synonymous mutation GTT(Val) -> GTC(Val) Non-synonymous mutation GTT (Val)-> GCT(Ala)

Natural Selection vs. Neutral Theory at Molecular Level Natural selection Advantageous genes evolve faster than the other ones. Namely, functionally important regions in a gene evolve faster than the other regions. Neutral theory at molecular level Funcitonally important regions in a gene evolve slower than the other regions due to functional restrictions on the former. In case of pro-insulin pro-insulin B chain C peptide A chain B A s s insulin (functional) C a discarded peptide (non-functional) C peptide prevents diabetic vascular and neural dysfunction. (Ido, Y.et al., Science 277: 563-566, 1997) evolutionary rate 0.13 10-9 / site / year evolutionary rate 0.97 10-9 / site / year

Evolution of DNA sequences Ancestral sequence A D t T C t G T d t Descedant sequences

Jukes-Cantor Method (1969) i : probability that a nucleotide at a site remains unchanged after t years t r : rate of nucleotide subsitution per year Same Same 1. A t A 1 - r 1 - r A t+1 A (1 - r) 2 2. A t A r r t+1 G (T, C) G (T, C) 1 1 1 3 ( ) ( ) r = r 3 3 3 2 2

Different Same 1. A (G) t G (A) t 1 - r r A t+1 A (T, C) 1 2 2 r (1 - r) = r (1 - r) 3 3 2. A t G r r t+1 T (C) T (C) 1 1 2 2 r r = r 2 3 3 9

From the figures we can derive the following difference equation, i t 1 [(1 r) 2 1 3 r2 ]i t [ 2 3 r(1 r) 2 9 r 2 ](1 i t ) (1 2r)i t 2 3 r(1 i t ) Thus, i t 1 i t 8 3 r(1 4 i ) t (1) (2) The difference equation can be approximated as the differential equation such that, di t dt 8 3 r( 1 4 i t ) (3) The solution of (3) is given as, 8 1 (c is a constant.) 4 i t ce Since t 0 i t 1 3 rt (4) c 3 4, and (4) is rewritten as, 8 3 rt 1 4 i t 3 4 e (5) Put d t 1 i t, then (5) is expressed as 8 3 rt ln(1 4 3 d t ) (6) Since 2rt D t, D t 3 4 ln(1 4 3 d t ) (7)

Estimation of Evolutionary Rates of Nucleotide Substitutions M. Kimura (1980) α P yrimid in e s ( U C ) Same UU CC AA GG Total (Frequency) (R 1 ) (R 2 ) (R 3 ) (R 4 ) (R) β β β β Differenet, Type I UC CU AG GA Total (Frequency) (P 1 ) (P 1 ) (P 2 ) (P 2 ) (P) Pu rin e s ( A G ) α Different, UA AU UG GU Type II (Q 1 ) (Q 1 ) (Q 2 ) (Q 2 ) Total CA AC CG GC (Q) (Frequency) (Q 3 ) (Q 3 ) (Q 4 ) (Q 4 ) Types of nucleotide base pairs occupied at homologous sites in two species. Type I difference includes four cases in which both are purines or both are pyrimidines (line 2). Type II difference consists of eight cases in which one of the bases is a purine and the other is a pyrimidine (lines 3 and 4).

P 1 (T T) [1 (2 4 ) T]P 1 (T) T[R 1 (T) R 2 (T)] T Q(T) / 2 P 2 (T T) [1 (2 4 ) T]P 2 (T) T[R 3 (T) R 4 (T)] T Q(T) / 2 Summing these two equations, and noting P(T) = 2P 1 (T) 2P 2 (T) and R(T) R 1 (T) R 2 (T) R 3 (T) R 4 (T) 1 P(T) Q(T), and writing P(T) P(T T) P(T), we get P(T) / T = 2-4( + )P(T) - 2( + )Q(T).- - - - - - - - - - - - - (1) Carrying out a similar series of calculations for base pairs of type II, we obtain Q(T) / T = 4-8 Q(T).- - - - - - - - - - - - - - - - - -(2) From these two finite difference equations (Eqs.1 and 2), we obtain the following set of differential equations dp(t) 2 4( )P(T) 2( )Q(T) dt dq(t) 4 8 Q(T) - - - - - - - - - - - - - - - - - -(3) dt The solution of this set of equations which satisfies the condition P(0) = Q(0) = 0,- - - - - - - - - - - - - - - - - - -(4) i.e., no base differences exist at T = 0, is as follows. P(T) = 1 4 1 2 e 4( )T 1 4 e 8 T - - - - - - - - - - - - - - - - - (5) Q(T) = 1 2 1 2 e 8 T.- - - - - - - - - - - - - - - - - - - - - - - -(6) Writing P T and Q T for P(T) and Q(T), we get, from these two equations, 4( + )T = -log e (1 2P T Q T ) - - - - - - - - - - - - - - - - - - - (7) and 8 T = -log e (1 2Q T ),- - - - - - - - - - - - - - - - - - - - - - -(8) so that 4 T = -log e (1 2P T Q T ) (1/ 2)log e (1 2Q T ).- - - - - - - - - - - - - -(9) Since the rate of evolutionary base substitutions per unit time is k = +2, the total number of substitutions (including revertant and superimposed changes) per site which separate the two species (and therefore involve two branches each with length T) is K = 2Tk = 2 T +4 T, where T and T are given by Eqs.(8) and (9). Then, omitting the subscript T from P T and Q T, we obtain K = - 1 2 log e{(1 2P Q) 1 2Q}.- - - - - - - - - - - - - - - - - - - -(10)

There are three classes of MHC genes Class I: A, B, C, E, F Class II: DP, DQ, DR, Class III: C2, C4, TNF, and Class I related genes: MICA, MICB, Classes I and II were coined by Jan Klein in 1977. (Klein, J. In The Major Histocompatibility System in Man and Animals, D. Gotze ed, Springer - Verlag, Berlin, 1977)

ORF 1: Gag-like protein?, ORF2: EN (integration) and RT (reverse transcriptase)

Our Approach 1. Because MHC genes themselves are known to have been subject to strong positive natural selection, we instead focused on neutral or nearly neutral segments in the MHC class I genome regions. Neutral genes and genome segments are known to evolve constantly over time (Kimura 1968). 2. We also try to carry out genome-oriented analysis. 3. We thus selected incomplete (and thus neutral) LINE sequences that were orthologous among the species studied.

Orthologous vs. Paralogous Hb Human Hemoglobin α Horse Hemoglobin α orthologous Hb Human Hemoglobin β paralogous Horse Hemoglobin β Mb Human Myoglobin generated by speciation generated by gene duplication Horse Myoglobin

Orthologous Repeated Sequences among Spec Generally, it is not easy to select repeated sequences that are orthologous to each other among the species studied. We confirmed they were orthologous among the species by examining the following four aspects. 1) their relative locations to other genes and fragments, 2) 5-3 direction, 3) structures and 4) homology.

Man Chimpanzee

A genomic region near CDSN rhesus monkey chimpanzee human

Number of LINE sites and %identity among human, chimpanzee and rhesus monkey

Evolutionary distance and divergence time among human, chimpanzee and rhesus monkey Rhesus Rhesus - Chimp Chimp 0.0629 ± 0.0015 - Human 0.0624 ± 0.0015 0.0125 ± 0.0006 Evolutionary distances (substitutions per site) were computed by Kimura's two parameter method.

Rate of LINEs in this region r 0.0125 1.04 10 9 6 2 6 10 site / year Divergence time between rhesus and (man, chim) T 0,062 / 2 30.0 million years 9 1.04 10

30.0 MHC-B-like genes (19 repeats)

Thank you very much