The nonsynonymous/synonymous substitution rate ratio versus the radical/conservative replacement rate ratio in the evolution of mammalian genes

Similar documents
"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Fixation of Deleterious Mutations at Critical Positions in Human Proteins

Effects of Gap Open and Gap Extension Penalties

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

8/23/2014. Phylogeny and the Tree of Life

POPULATION GENETICS Biology 107/207L Winter 2005 Lab 5. Testing for positive Darwinian selection

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

7. Tests for selection

C3020 Molecular Evolution. Exercises #3: Phylogenetics

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Processes of Evolution

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Application of new distance matrix to phylogenetic tree construction

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Letter to the Editor. Department of Biology, Arizona State University

Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

Molecular Coevolution of the Vertebrate Cytochrome c 1 and Rieske Iron Sulfur Protein in the Cytochrome bc 1 Complex

Phylogeny and the Tree of Life

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

BLAST. Varieties of BLAST

Sequence Alignment Techniques and Their Uses

Taming the Beast Workshop

Chapter 16: Reconstructing and Using Phylogenies

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Graph Alignment and Biological Networks

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Proceedings of the SMBE Tri-National Young Investigators Workshop 2005

Letter to the Editor. Temperature Hypotheses. David P. Mindell, Alec Knight,? Christine Baer,$ and Christopher J. Huddlestons

Basic Local Alignment Search Tool

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

PHYLOGENY AND SYSTEMATICS

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Single alignment: Substitution Matrix. 16 march 2017

Tools and Algorithms in Bioinformatics

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Comparing Genomes! Homologies and Families! Sequence Alignments!

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Phylogenetic Tree Reconstruction

From DNA to Diversity

Temporal Trails of Natural Selection in Human Mitogenomes. Author. Published. Journal Title DOI. Copyright Statement.

FUNDAMENTALS OF MOLECULAR EVOLUTION

SUPPLEMENTARY INFORMATION

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny: building the tree of life

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

BIOINFORMATICS LAB AP BIOLOGY

BINF6201/8201. Molecular phylogenetic methods

SUPPLEMENTARY INFORMATION

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Concepts and Methods in Molecular Divergence Time Estimation

Classification and Phylogeny

Understanding relationship between homologous sequences

Phylogenetics: Building Phylogenetic Trees

Chapter 19: Taxonomy, Systematics, and Phylogeny

GATA family of transcription factors of vertebrates: phylogenetics and chromosomal synteny

Dr. Amira A. AL-Hosary

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Classification and Phylogeny

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Piecing It Together. 1) The envelope contains puzzle pieces for 5 vertebrate embryos in 3 different stages of

The Phylogenetic Handbook

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

A profile-based protein sequence alignment algorithm for a domain clustering database

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

BIOINFORMATICS: An Introduction

Homology and Information Gathering and Domain Annotation for Proteins

Lecture Notes: BIOL2007 Molecular Evolution

Bioinformatics Exercises

Phylogenetic inference

Cubic Spline Interpolation Reveals Different Evolutionary Trends of Various Species

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

I. Short Answer Questions DO ALL QUESTIONS

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Quantifying sequence similarity

Natural selection on the molecular level

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Chapter 26 Phylogeny and the Tree of Life

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Transcription:

MBE Advance Access published July, 00 1 1 1 1 1 1 1 1 The nonsynonymous/synonymous substitution rate ratio versus the radical/conservative replacement rate ratio in the evolution of mammalian genes Kousuke Hanada 1,, Shin-Han Shiu and Wen-Hsiung Li 1 * 1. Department of Ecology and Evolution, University of Chicago, Chicago, IL 0. Department of Plant Biology, Michigan State University, East Lansing, MI Running head: Ka/Ks ratio vs radical/conservative replacement ratio Key words: positive selection, radical substitution, conservative substitution, classification of amino acids, development. *Corresponding author. Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago 01 East th Street, Chicago, IL, 0, USA. Tel: +1- -0-. Fax: +1- -0-0. E-mail: whli@uchicago.edu The Author 00. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org 1

1 1 1 1 1 1 1 1 0 Abstract There are two ways to infer selection pressures in the evolution of protein-coding genes: the nonsynonymous and synonymous substitution rate ratio (K A /K S ) and the radical and conservative amino acid replacement rate ratio (K R /K C ). Since the K R /K C ratio depends on the definition of radical and conservative changes in the classification of amino acids, we develop an amino acid classification that maximizes the correlation between K A /K S and K R /K C. An analysis of, orthologous gene groups among five mammalian species shows that our classification gives a significantly higher correlation coefficient between the two ratios than those of existing classifications. However, there are many orthologous gene groups with a low K A /K S but a high K R /K C ratio. Examining the functions of these genes, we found an overrepresentation of functional categories related to development. To determine if the over-representation is stage specific, we examined the expression patterns of these genes at different developmental stages of the mouse. Interestingly, these genes are highly expressed in the early middle stage of development (Blastocyst to Amnion). It is commonly thought that developmental genes tend to be conservative in evolution, but some molecular changes in developmental stages should have contributed to morphological divergence in adult mammals. Therefore, we propose that the relaxed pressures indicated by the K R /K C ratio but not by K A /K S in the early middle stage of development may be important for the morphological divergence of mammals at the adult stage, while purifying selection detected by K A /K S occurs in the early middle developmental stage.

1 1 1 1 1 1 1 1 0 1 0 1 Introduction Selection pressure on protein-coding sequences is commonly estimated by the ratio of the nonsynonymous substitution rate (K A ) to the synonymous substitution rate (K S ) (Li and Gojobori 1; Hughes and Nei 1). If the K A /K S ratio is higher than 1, positive selection is assumed to have occurred during the evolution of the sequence. The ratio of the radical replacement rate (K R ) to the conservative replacement rate (K C ) has also been used to detect positive selection (Hughes, Ota, and Nei ). The K R /K C ratio is useful for examining selection pressure in distantly related protein-coding sequences because the K A /K S ratio cannot be accurately estimated in this case due to saturation of K S (Gojobori 1; Smith and Smith 1). Since there are two ways of inferring selection pressure on a sequence, an open question is whether these two approaches give the same conclusion or not. Zhang (000) and Smith (00) found that K A /K S is correlated with K R /K C based on the amino acid classification that considers polarity and volume, using mammalian and Drosophila genes. However, there are several types of amino acid classifications and it is not known which classification gives a K R /K C measure that best correlates with the K A /K S ratio. Therefore, we do not know the degree of correlation between the two ratios in general. In the present study, we searched for an amino acid classification that gives the best correlation between the two ratios. This amino acid classification is useful because the K R /K C ratio based on this classification can identify genes undergoing similar selection pressures inferred by the K A /K S ratio between distant protein-coding sequences. Another issue is that it is likely that the two ratios are not completely correlated even if the amino acid classification that gives the maximum correlation between the two ratios is used. To address the differences between the selection pressures inferred by K A /K S and K R /K C in the evolution of mammalian genes, we examined functions of genes that showed different selection pressures inferred by the two ratios, using Gene Ontology (GO) categories and expression data of a representative mammal, the mouse. Materials & Methods Construction of orthologous groups cdna data of five mammalian species were retrieved from the Ensembl database (www.ensembl.org): Homo sapiens (NCBI.may), Pan troglodytes (CHIMP1.may), Mus

1 1 1 1 1 1 1 1 0 1 0 1 musculus (NCBIM.may), Rattus norvegicus (RGSC..may) and Canis familiaris (BROADD1.may). Reciprocal best hits between every combination of two species were identified with Blastp (Altschul et al. 1). For sequences that are reciprocal best hits among all species combinations (Fig. 1A), they were considered as an orthologous group among the five species., putative orthologous groups were constructed according to the procedure. To further verify the, orthologous groups, phylogenetic trees were constructed using the protein sequence alignments of members in an orthologous group by the neighbor-joining (NJ) method (Saitou and Nei 1; Thompson, Higgins, and Gibson 1). When the topology was different from the species tree, the data set was removed from the orthologous data (Fig 1B). The total number of orthologous groups was reduced to,. For the numbers of nucleotide sites used in these orthologous groups, the interquartile range (%-%) and the median number of nucleotide sites are.0-.0 and.0, respectively. The orthologous gene groups in the five mammalian species were determined as follows. The orthologous gene data were carefully constructed to reduce errors for estimating nucleotide and amino substitutions. Only segments aligned among the five species without any gaps were used for the calculation of the K A /K S and K R /K C ratios. Estimation of K A /K S and K R /K C in each orthologous gene set A phylogenetic tree was reconstructed for each orthologous gene group by the NJ method (Saitou and Nei 1). The ancestral sequence was inferred at each node in the phylogenetic tree using the maximum likelihood method (Yang, Kumar, and Nei 1). The transition/transversion ratio was estimated in each orthologous group and the ratio was then used to estimate K A and K S in all branches in the phylogenetic tree by the modified Nei-Gojobori method (Zhang, Rosenberg, and Nei 1). The sums of K A and K S of all branches were used to determine the K A /K S ratio in each orthologous gene group. Radical and conservative changes were defined by a classification (A) that gave the best correlation between K R /K C and K A /K S and also by three previous classifications with respect to the chemical properties: (B) polarity and volume, (C) charge and aromaticity, and (D) charge and polarity (Zhang 000; Hanada, Gojobori, and Li 00) (Table 1). These so-called physicochemical properties (aromaticity, charge, polarity, and volume) are thought to be relevant for the evolution of proteins (Grantham 1; Miyata, Miyazawa, and Yasunaga 1). Based on the ancestral sequences inferred at all nodes in the phylogenetic tree of each orthologous group,

1 1 1 1 1 1 1 1 0 1 0 1 K R and K C were estimated in all branches in the phylogenetic tree by the Zhang method (Zhang 000). The sums of branch lengths that reflected K R and K C were used to determine the K R /K C ratio in each orthologous group. Average K A, K S, K R and K C in each branch of species tree among, orthologous groups are given in Supplement A. Construction of a new amino acid classification To estimate the average K A /K S ratio for each amino acid replacement, we collected from the orthologous gene groups the amino acid replacements that had occurred. The average K A /K S ratio for each type of amino acid replacement is defined to be the average K A /K S ratio in the collected orthologous gene groups. The average K A /K S ratios were estimated for each of the kinds of amino acid replacement occurring by single nucleotide substitution. Since the amino acid replacement having a low (high) K A /K S ratio should tend to be a conservative (radical) change in the highly associated classification, radical and conservative scores were numbered for types of amino acid replacement in descending (ascending) order of K A /K S (Supplement B). Using the radical and conservative scores for the types of amino acid replacement, we calculated the totals of radical and conservative scores for each amino acid classification. To find an amino acid classification that would give the maximum correlation between K R /K C and K A /K S, amino acids were classified into two to five groups in all possible combinations and we identified the classification with the highest score. The new classification is regarded as the amino acid classification that can more adequately characterize the relationship between K A /K S and K R /K C. Functional categories by Gene Ontology. Orthologous gene groups with the top and bottom % K A /K S or K R /K C values were considered as relaxed selection groups and purifying selection groups, respectively. Under this classification, there are four possible combinations for the orthologous gene groups: (1) relaxed selection groups inferred by both K A /K S and K R /K C (a high K A /K S and a high K R /K C ), () purifying selection groups inferred by both K A /K S and K R /K C (a low K A /K S and a low K R /K C ), () relaxed and purifying selection groups inferred by K A /K S and by K R /K C (a high K A /K S and a low K R /K C ), respectively, and () purifying selection and relaxed selection groups inferred by K A /K S and by K R /K C (a low K A /K S and a high K R /K C ), respectively. Gene Ontology (GO) assignments for the mouse genes were obtained from the mouse genome database (Hill et al. 00). To simplify functional interpretation, we used the GO

1 1 1 1 1 1 1 1 0 1 0 1 categories of biological processes from top to the th depth in the hierarchy. The expected proportion of each GO category assigned by the mouse genes was compared with the observed proportion of each GO category assigned by the mouse genes of orthologous gene groups undergoing different selection pressures by the chi-square test. When the observed proportion is significantly higher than the expected proportion in a given GO category (P<0.0), the hierarchical pathways from the root to the overrepresented GO category were shown by the Graphviz software (www.graphviz.org). The expression pattern at a developmental stage. The mouse expression dataset covering various stages of mouse development (Ringwald et al. 001) was used to determine the relationships between gene expression and the nature of selection pressure as determined by the K A /K S and K R /K C measures. Among different selection pressures, we compared the expression bias of genes at a developmental stage by the following equation. Nob. Nob. R = = Nex. Pall Nselected For a particular developmental stage, Nob. and Nex. are the observed and expected numbers of expressed genes that experienced purifying or relaxed selection pressure at the developmental stage, Pall is the proportion of all mouse genes expressed at a given developmental stage, and Nselected is the total number of genes undergoing each of four types of selection pressures. Nex. was calculated by multiplying Pall by Nselected. Results A new classification of amino acids To find a new classification that yields the maximum correlation between K A /K S and K R /K C, we first constructed all possible combinations in which the 0 amino acids can be classified into two to five groups. Second, a table representing the average K A /K S ratio for each type of amino acid replacement was constructed to see what kinds of amino acid replacements more adequately characterize the K A /K S ratio (Supplement B). Based on the table, a new classification of amino acids with a higher correlation between the K A /K S ratio and the radical or conservative change was constructed (Classification A in Table 1). In the new classification, amino acids are classified into basic, acidic and neutral charges. The aromatic amino acids belong

1 1 1 1 1 1 1 1 0 1 0 1 to the group of the basic charges because one of the aromatic amino acids has a basic charge. The amino acids with neutral charge are classified into small and large volumes that fall into distinct groups. Consequently, this new classification seems to be constructed with respect to the chemical properties of charge, aromaticity and volume. Correlation between K R /K C and K A /K S Using three existing amino acid classifications and our new classification, we estimated four K R /K C ratios for each orthologous gene group. The four K R /K C ratios were significantly positively correlated with each other (P < 0.01) (Table ). In terms of the correlation between K R /K C and K A /K S, the correlation coefficient in the new classification (A, r=0. Table ) was expected to be the highest among the four chemical classifications because the new classification (A) was constructed by the chemical properties associated with the K A /K S ratio. In fact, the correlation coefficient between K A /K S and K R /K C based on the new classification is significantly higher than those based on the other three classifications (P < 0.01), though the other three K R /K C ratios are also each positively correlated with the K A /K S ratio (P < 0.01) (Fig.). However, even under the new classification, which gives the highest correlation between the two ratios, the correlation coefficient is less than 0., indicating that selective pressures inferred by the K R /K C ratio and by the K A /K C ratio differ substantially. In particular, there are many orthologous gene groups with a low K A /K S and a high K R /K C ratio (Fig. ). These orthologous gene groups have likely undergone relaxed selection in radical amino acid substitutions as indicated by the K R /K C ratio but experienced purifying selection in non-synonymous changes as indicated by the K A /K S ratio. Overrepresented functional categories undergoing opposite selection pressures inferred by two ratios There are four types of selection pressure experienced by the orthologous gene groups. The number of orthologous gene groups that experienced relaxed or purifying selection pressures in the two ratios is shown in Table and the gene lists are given in Supplement C. Since K A /K S was on the whole positively correlated with K R /K C in mammals, a larger number of groups undergoing the same selection pressures in the two ratios was found in the comparison with the number of groups that underwent the opposite selection pressures in the two ratios. The groups with the opposite selection pressures are only found in a high K R /K C and a low K A /K S ratio.

1 1 1 1 1 1 1 1 0 1 0 1 To assess the functions of groups that underwent different selection pressures, we examined significantly overrepresented Gene Ontology (GO) categories of mouse genes in orthologous gene groups subject to each type of selection pressures (Fig., Supplement D). The overrepresented functions of genes with a high K R /K C and a high K A /K S ratio are related to "response to stimulus and physiological process. In particular, several functions related to defense response can be clearly found in these genes. Since genes related to defense response are in general accepted as genes undergoing positive selection, these results seem biologically reasonable. On the other hand, the overrepresented functions of genes with a low K A /K S ratio are related to development. This result is also reasonable because most of the genes related to development are subject to purifying selection based on the K A /K S ratio between distantly related species (Powell et al. 1; Slack, Holland, and Graham 1). However, it is unclear whether this holds true if the K R /K C ratio is used to evaluate the selection pressure in genes related to development. In genes with a low K A /K S ratio, sex determination and cell differentiation are overrepresented in genes with a high and a low K R /K C ratio, respectively (Fig. ). Sex determination is likely conserved among mammals but cell differentiation may be required to be somewhat different among mammals for the divergent evolution seen in mammals. Thus, it is possible that relaxed selection pressures indicated by the K R /K C ratio may be one of the important factors for the evolution in mammals. To further examine the different gene functions between the high and low K R /K C ratios in mammalian development, we examined the expression of mouse genes with different selection pressures using the mouse expression dataset covering various stages of development (Fig. A, B). Genes subject to purifying selection based on both ratios are expressed at high levels at the early developmental stages (One cell egg to Blastocyst). On the other hand, genes subject to purifying selection indicated by K A /K S but relaxed selection indicated by K R /K C were expressed predominantly in the early middle stage of development (Blastocyst to Amnion). The relaxed pressures indicated solely by the K R /K C ratio in the early middle stage of development may be important for the divergent evolution in mammals. Discussion The key finding of the present study is that a positive correlation between K A /K S and

1 1 1 1 1 1 1 1 0 1 0 1 K R /K C at a genomic scale is observed in all amino acid classifications, indicating that the two tests of selection pressure give similar conclusions in mammalian evolution. In particular, the K R /K C ratio of the new classification is useful for estimating selection pressure between distantly related sequences (Gojobori 1; Smith and Smith 1). Since the evolutionary rate of synonymous substitution is much faster than that of nonsynonymous substitution, K S is often saturated between distant sequences. On the other hand, the K R /K C ratio is estimated by only amino acid replacements and the evolutionary rate of amino acid replacement is much slower than that of synonymous substitution, so that the K R /K C ratio can be estimated for distant sequences. Thus, the new classification (A) can produce a useful K R /K C ratio for estimating the selection pressure in distant sequences. It should be noted that several reports had classified amino acid replacements into radical and conservative amino acid changes by the likelihood of amino acid replacements and estimated selection pressures by such radical and conservative amino acid changes (Tang et al. 00; Gojobori et al. 00). On the other hand, in the present study, we defined radical and conservative changes by the likelihoods of nonsynonymous and synonymous substitutions. Therefore, the selection pressures inferred by radical and conservative changes under our definition should more likely lead to similar selection pressures inferred by the K A /K S ratio. However, a major limitation in substituting K R /K C for K A /K S is that, even when we used the new classification aimed at maximizing the correlation between K R /K C and K A /K S, the correlation between K R /K C and K A /K S is still less than 0.. There are potentially two reasons why the two ratios are not highly correlated. One reason is biological. For some genes, K R /K C may not be related to the type of natural selection identified by K A /K S. The other reason is technical. In the computation of the K R /K C ratio, radical and conservative changes were defined as amino acid replacements between groups and within groups, respectively. In view of the fact that the radical and conservative changes are defined to be always 0 or 1, the K R /K C ratio may not fully represent the selection pressure of amino acid replacements. We note that there are many orthologous gene groups with a low K A /K S and a high K R /K C as outliers. To address the opposite selection pressures, we examined the functions of mouse genes and found that functional categories related to development were overrepresented in these genes. We then examined these gene expression patterns at different developmental stages. The mouse genes that underwent such selection pressures tend to be over-expressed in the early

1 1 1 1 1 1 1 1 0 middle developmental stages. Richardson (1) proposed that the early middle developmental stages were important for speciation of mammals because these are the stages when many adult traits are specified even if these stages were conservative in the morphological level. Therefore, we propose that the relaxed selection pressures indicated by K R /K C but not by K A /K S in the early middle developmental stages may be important for the morphological divergence of mammals at the adult stage, while purifying selection detected by K A /K S tends to occur in the early middle developmental stages. The differences in the selection pressures assessed by K A /K S and K R /K C indicate that, although genes involved in development have strong constraints in amino acid substitutions, radical changes in the substitutions permitted are likely important for developmental divergence of adult mammals. Thus, opposite selection pressures in the two ways might play an important role in the evolution of genes related to development in mammals. In summary, we inferred, orthologous gene groups in mammalian species in a stringent manner. K R /K C is positively correlated with K A /K S. The correlation was observed in each of four chemical classifications taking account of aromaticity, charge, polarity or volume. In particular, the chemical classification for aromaticity, charge and volume led to the highest correlation between these two ratios. Moreover, the genes with high K R /K C but low K A /K S were over-represented with genes expressed at a high level in the early middle developmental stages. The selection pressures at these developmental stages may be important for the morphological diversification of mammals. 1 Acknowledgements We thank the members of our laboratories for valuable comments and discussion. This study was supported by NIH grant (GM0) to W.-H. L. and an NSF grant (DBI-01) to S.-H. S.

Table 1. Four classifications of amino acids. Classification A by the maximum correlation with the K A /K S ratio Neutral & small (MW*: -1) A N C G P S T Neutral & large I L M V (MW*: 1-0) Basic acid, Aromaticity & Relatively small R Q H K F W Y (MW*: -1) Acidic charge & Relatively large D E (MW*: 1-1) Classification B by polarity & volume Special C Neutral and Small A G P S T Polar & relatively small N D Q E Polar & relatively large R H K Nonpolar & relatively small I L M V Nonpolar & relatively large F W Y Classification C by charge & aromatic Acidic D E Neutral & No aromaticity Q A V L I C S T N G P M Neutral & Aromaticity F Y W Basic K R H Classification D by charge & polarity Neutral & Polarity S T Y C N Q Acidic & Polarity D E Basic & Polarity K R H No polarity G A V L I F P M W *MW: Molecular weight

Table Correlation coefficient between K R /K C and K A /K S. K R /K C (Classification B) K R /K C (Classification C) K R /K C (Classification D) K A /K S K R /K C (Classification A) 0. 0. 0. 0. K R /K C (Classification B) 0. 0. 0. K R /K C (Classification C) 0. 0. K R /K C (Classification D) 0. 1

Table The number of orthologous groups undergoing different selection pressures Orthologous groups under relaxed selection indicated by K R /K C ( % top of K R /K C ratio) Orthologous groups under purifying selection indicated by K R /K C ( % bottom of K R /K C ratio) Orthologous groups under relaxed selection indicated by K A /K S ( % top of K A /K S ratio) Orthologous groups under purifying selection indicated by K A /K S ( % bottom of K A /K S ratio) 0 1 1

1 1 1 1 1 1 1 1 0 1 0 1 0 1 Literature Cited Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res :-0. Gojobori, J., H. Tang, J. M. Akey, and C. I. Wu. 00. Adaptive evolution in humans revealed by the negative correlation between the polymorphism and fixation phases of evolution. Proc Natl Acad Sci U S A :0-1. Gojobori, T. 1. Codon substitution in evolution and the "saturation" of synonymous changes. Genetics :-. Grantham, R. 1. Amino acid difference formula to help explain protein evolution. Science 1:-. Hanada, K., T. Gojobori, and W. H. Li. 00. Radical amino acid change versus positive selection in the evolution of viral envelope proteins. Gene :-. Hill, D. P., J. A. Blake, J. E. Richardson, and M. Ringwald. 00. Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies. Genome Res 1:1-11. Hughes, A. L., and M. Nei. 1. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature :1-. Hughes, A. L., T. Ota, and M. Nei.. Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Mol Biol Evol :1-. Li, W. H., and T. Gojobori. 1. Rapid evolution of goat and sheep globin genes following gene duplication. Mol Biol Evol 1:-. Miyata, T., S. Miyazawa, and T. Yasunaga. 1. Two types of amino acid substitutions in protein evolution. J Mol Evol 1:1-. Powell, J. R., A. Caccone, J. M. Gleason, and L. Nigro. 1. Rates of DNA evolution in Drosophila depend on function and developmental stage of expression. Genetics 1:1-. Ringwald, M., J. T. Eppig, D. A. Begley, J. P. Corradi, I. J. McCright, T. F. Hayamizu, D. P. Hill, J. A. Kadin, and J. E. Richardson. 001. The Mouse Gene Expression Database (GXD). Nucleic Acids Res :-1. Saitou, N., and M. Nei. 1. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol :0-. Slack, J. M., P. W. Holland, and C. F. Graham. 1. The zootype and the phylotypic stage. Nature 1:0-. Smith, J. M., and N. H. Smith. 1. Synonymous nucleotide divergence: what is "saturation"? Genetics 1:-. Smith, N. G. 00. Are radical and conservative substitution rates useful statistics in molecular evolution? J Mol Evol :-. Tang, H., G. J. Wyckoff, J. Lu, and C. I. Wu. 00. A universal evolutionary index for amino acid changes. Mol Biol Evol 1:1-1. Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res :-0. Yang, Z., S. Kumar, and M. Nei. 1. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics :-. Zhang, J. 000. Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. J Mol Evol 0:-. Zhang, J., H. F. Rosenberg, and M. Nei. 1. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci U S A :0-1. 1

1 1 1 1 1 1 1 1 0 1 0 1 Figure legends Fig. 1. Construction of ortholog data. The similarity search was conducted by Blastp as in Fig. 1A. Reciprocal best hits were identified between every pair of species. The number of reciprocal best hits between pair of species is shown between each pair of species. When sequences reciprocally had the best hits among the five species, the sequences were considered an orthologous gene among the five species. A phylogeny was then generated for each orthologous gene group. When the phylogeny of the orthologs from the five species is different from the topology of the species phylogeny, this putative ortholog was removed from the ortholog data. The species phylogeny is shown in Fig. 1B. Fig.. Correlation between K A /K S and K R /K C. The X-axis is the K A /K S ratio and the Y-axis is the K R /K C ratio. The ratios were computed based on classification A (r=0.) (A); classification B (r=0.) (B); classification C (0.) (C); and classification D (r=0.) (D). Fig.. Overrepresented functions in genes with a low K A /K S and a high K R /K C and genes with a low K A /K S and a low K R /K C ratio. The arrowheads point to subcategories. (A) Categories overrepresented in genes with a low K A /K S and a high K R /K C are in black circles (P < 0.0). (B) Categories overrepresented in genes with a low K A /K S and a low K R /K C are in black circles (P < 0.0). Fig.. Expression levels of genes with different selection pressures in each developmental stage. (A) The X-axis indicates the developmental stage. The names of each stage are as follows: 1 (One cell egg), (Beginning of cell division), (Morula), (Advanced division/segmentation), (Blastocyst), (Implantation), (Formation of egg cylinder), (Differentiation of egg cylinder), (Advanced endometrial reaction; prestreak), (Amnion; midstreak), (Neural plate, presomite; no allantoic bud), 1 (First somites; late head fold), 1 (Turning), 1 (Formation & closure anterior neuropore), 1 (Formation of posterior neuropore, forelimb bud), 1 (Closure post. neuropore, hindlimb & tail bud), 1 (Deep lens indentation), 1 (Closure lens vesicle), 1 1

(Complete separation of lens vesicle), 0 (Earliest sign of fingers), 1 (Anterior footplate indented, marked pinna), (Fingers separate distally), (Toes separate), (Reposition of umbilical hernia), (Fingers and toes joined together), (Long whiskers) and (Postnatal development). The Y-axis indicates the normalized difference of expressed genes between genes undergoing a selection pressure and all genes. (B) The sliding window analysis ( stages) was conducted based on (A). The X-axis is the mean of normalized difference in five developmental stages. The Y-axis indicates the average normalized difference in each window. 1

FIG 1 A B 1, Human, Mouse Human Dog 1, Chimpanzee 1, 1,0, 1, 1,0 1, Mouse, Rat Rat Chimpanzee Dog

FIG KR/KC ratio (Classification C) KR/KC ratio (Classification A) A. 1. 1 0. 0 0 0. 0. 0. 0. 1 KA/KS ratio 1 0 0 0. 0. 0. 0. 1 KA/KS ratio C KR/KC ratio (Classification D) KR/KC ratio (Classification B)... 1. 1 0. 0 0 0. 0. 0. 0. 1 KA/KS ratio 1 0 0 0. 0. 0. 0. 1 KA/KS ratio B D

FIG A embryonic_development (sensu_metazoa) embryonic_development axis_specification development pattern_specification anterior/posterior pattern_formation biological_process cellular_process cell_differentiation epidermal_cell_differentiation regulation_of_biological process regulation_of_development regulation_of_epidermis development regulation_of_binding B sex_determination male_sex_determination development pattern_specification axis_specification biological_process growth developmental_growth blastocyst_growth response_to_stimulus behavior visual_behavior

FIG A Genes under purifying selection indicated by both K A /K S and K R /K C Genes under purifying selection indicated by K A /K S but relaxed selection indiated by K R /K C. B Genes under relaxed selection indicated by both K A /K S and K R /K C 1. 1 1 0. 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1