Detecting the correlated mutations based on selection pressure with CorMut

Similar documents
Detecting the correlated mutations based on selection pressure with CorMut

Supporting Information

7.36/7.91 recitation CB Lecture #4

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

BINF 730. DNA Sequence Alignment Why?

Correction for Ascertainment

QB LECTURE #4: Motif Finding

Understanding relationship between homologous sequences

Monomeric Clavularia CFP July 19, 2006

Solution: (a) Before opening the parachute, the differential equation is given by: dv dt. = v. v(0) = 0

Local Alignment Statistics

Compartmentalization detection

Grilled it ems are prepared over real mesquit e wood CREATE A COMBO STEAKS. Onion Brewski Sirloin * Our signature USDA Choice 12 oz. Sirloin.

Quantifying sequence similarity

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Today s Lecture: HMMs

Pairwise sequence alignment

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Housing Market Monitor

Probabilistic modeling and molecular phylogeny

Structural Perspectives on Drug Resistance

Lecture Notes: BIOL2007 Molecular Evolution

Helping Kids Prepare For Life (253)

I of a gene sampled from a randomly mating popdation,

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

PHP 2510 Probability Part 3. Calculate probability by counting (cont ) Conditional probability. Law of total probability. Bayes Theorem.

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Sequence Alignment using Profile HMM

One-Way Tables and Goodness of Fit

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Fuzzy modeling and control of HIV infection

Tools and Algorithms in Bioinformatics

Sample Size Estimation for Studies of High-Dimensional Data

Introduction to Wright-Fisher Simulations. Ryan Hernandez

RENEWAL PROCESSES AND POISSON PROCESSES

9.5 The Transfer Function

bioinformatics 1 -- lecture 7

Keywords: Multimode process monitoring, Joint probability, Weighted probabilistic PCA, Coefficient of variation.

Processes of Evolution

Statistical-mechanical study of evolution of robustness in noisy environments

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Hidden Markov Models

Introduction to Bioinformatics

Computational Biology

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Sequence analysis and comparison

Sequence Alignment (chapter 6)

Local Alignment: Smith-Waterman algorithm

Fitness landscapes and seascapes

Final Exam. 1. Definitions: Briefly Define each of the following terms as they relate to the material covered in class.

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Exhibit 2-9/30/15 Invoice Filing Page 1841 of Page 3660 Docket No

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Score: Fall 2009 Name Row 80. C(t) = 30te- O. 04t

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Bioinformatics Exercises

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

Lecture Notes: Markov chains

Copyright 2000 N. AYDIN. All rights reserved. 1

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Introduction to Bioinformatics Online Course: IBT

Chemical properties that affect binding of enzyme-inhibiting drugs to enzymes

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Practical considerations of working with sequencing data

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

Introduction to Binary Convolutional Codes [1]

STAT5044: Regression and Anova. Inyoung Kim

Comparing whole genomes

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Nursing Facilities' Life Safety Standard Survey Results Quarterly Reference Tables

Comparing Means from Two-Sample

Week 10: Homology Modelling (II) - HHpred

A Mathematical Study of Germinal Center Formation

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Pathway Association Analysis Trey Ideker UCSD

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

ACCOUNTING FOR INPUT-MODEL AND INPUT-PARAMETER UNCERTAINTIES IN SIMULATION. < May 22, 2006


Taming the Beast Workshop

SEQUENCE DIVERGENCE,FUNCTIONAL CONSTRAINT, AND SELECTION IN PROTEIN EVOLUTION

Supplementing information theory with opposite polarity of amino acids for protein contact prediction

i.ea IE !e e sv?f 'il i+x3p \r= v * 5,?: S i- co i, ==:= SOrq) Xgs'iY # oo .9 9 PE * v E=S s->'d =ar4lq 5,n =.9 '{nl a':1 t F #l *r C\ t-e

QM/MM study on inhibitor of HIV-1 protease. Graduate School of Science, Kyoto University Masahiko Taguchi, Masahiko Kaneso, Shigehiko Hayashi

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

REGRESSION ANALYSIS BY EXAMPLE

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

D. Incorrect! That is what a phylogenetic tree intends to depict.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Transcription:

Detecting the correlated mutations based on selection pressure with CorMut Zhenpeng Li October 30, 2017 Contents 1 Introduction 1 2 Methods 2 3 Implementation 3 1 Introduction In genetics, the Ka/Ks ratio is the ratio of the number of non-synonymous substitutions per non-synonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks)[1]. Ka/Ks ratio was used as an indicator of selective pressure acting on a protein-coding gene. A Ka/Ks ratio of 1 indicates neutral selection, i.e., the observed ratio of non-synonymous mutations versus synonymous mutations exactly matches the ratio expected under a random mutation model. Thus, amino acid changes are neither being selected for nor against. A Ka/Ks value of <1 indicates negative selection pressure. That is to say most amino acid changes are deleterious and are selected against, producing an imbalance in the observed mutations that favors synonymous mutations. In the condition of Ka/Ks>1, it indicates that amino acid changes are favored, i.e., they increase the organism s fitness. This unusual condition may reflect a change in the function of a gene or a change in environmental conditions that forces the organism to adapt. For example, highly variable viruses mutations which confer resistance to new antiviral drugs might be expected to undergo positive selection in a patient population treated with these drugs, such as HIV and HCV. The concept of correlated mutations is one of the basic ideas in evolutionary biology. The amino acid substitution rates are expected to be limited by functional constraints. Given the functional constraints operating on gene, a mutation in one position can be compensated by an additional mutation. Then mutation patterns can be formed by correlated mutations responsible for specific conditions. Here, we developed an R/Bioconductor package to detect the correlated mutations among positive selection sites by combining ka/ks ratio and correlated mutations analysis. The definition of Ka/Ks is based on Chen et al[2]. CorMut provides functions for computing kaks for individual site or specific amino acids and detecting correlated mutations among them. Two methods are provided for detecting correlated mutations, 1

including conditional selection pressure and mutual information, the computation consists of two steps: First, the positive selection sites are detected; Second, the mutation correlations are computed among the positive selection sites. Note that the first step is optional. Meanwhile, CorMut facilitates the comparison of the correlated mutations between two conditions by the means of correlated mutation network. 2 Methods 2.1 Ka/Ks calculation for individual codon position and specific amino acid substitutions The Ka/Ks values for individual codon (specific amino acid substitutions) were determined as described by Chen et al. The Ka/Ks of individual codon (specific amino acid substitution (X2Y) for a codon) was computed as follows: Ka Ks = NY N S n Y,t f t +n Y,v fv n S,t f t +n S,v fv where NY and NS are the count of samples with nonsynonymous mutations(x2y mutation) at that codon and the count samples with of synonymous mutations observed at that codon respectively. Then, NY/NS is normalized by the ratio expected under a random mutation model (i.e., in the absence of any selection pressure), which was represented as the denominator of the formula. In the random mutation model, ft and fv indicate the transition and transversion frequencies respectively, and they were measured from the entire data set according to the following formulas: ft = Nt/ntS and fv = Nv/nvS, where S is the total number of samples; Nt and Nv are the numbers of observed transition and transversion mutations, respectively; nt and nv are the number of possible transitions and transversions in the focused region (simply equal to its length L and 2L respectively).lod confidence score for a codon or mutation X2Y to be under positive selection pressure was calculated by the following formula: ( ( K LOD = log 10 p(i N Y axa N, q, a N N K s = 1) = log 10 )Y Xa i=n Y axa i Where N = NYaXa + NYsXa and q as defined above. If LOD > 2, the positive selection is significant. 2.2 Mutual information MI content was adopted as a measure of the correlation between residue substitutions[3]. Accordingly, each of the N columns in the multiple sequence alignment generated for a protein of N codons is considered as a discrete random variable Xi(1 i N). When compute Mutual information between codons, Xi(1 i N) takes on one of the 20 amino acid types with some probability, then compute the Mutual information between two sites for specific amino acid substitutions, Xi(1 i N) were only divided two parts (the specific amino acid mutation and the other) with some probability. Mutual information describes the mutual dependence of the two random variables Xi, Xj, The MI associated the random variables Xi and Xj corresponding to the ith and jth columns is defined as I(X i, X j ) = S(X i ) + S(X j ) S(X i, X j ) = S(X i ) S(X i X j ) Where S(X i ) = P (X i = x i ) log P (X i = x i ) allx i is the entropy of Xi, S(X i X j ) = allx i allx j P (X i = x i, X j = x j ) log P (X i = x i X j = x j ) ) q i (1 q) N i 2

is the conditional entropy of Xi given Xj, S(Xi, Xj) is the joint entropy of Xi and Xj. Here P(Xi = xi, Xj = xj) is the joint probability of occurrence of amino acid types xi and xj at the ith and jth positions, respectively, P(Xi = xi) and P(Xj = xj) are the corresponding singlet probabilities. I(Xi, Xj) is the ijth element of the N N MI matrix I corresponding to the examined multiple sequence alignment. 2.3 Jaccard index Jaccard index measures similarity between two variables, and it has been widely used to measure the correlated mutations. For a pair of mutations X and Y, the Jaccard index is calculated as Nxy/(Nxy+Nx0+Ny0), where Nxy represents the number of sequences where X and Y are jointly mutated, Nx0 represents the number of sequences with only X mutation, but not Y, and Ny0 represents the sequences with only Y mutation, but not X. The computation was deemed to effectively avoid the exaggeration of rare mutation pairs where the number of sequences without either X or Y is very large. 3 Implementation 3.1 Process the multiple sequence alignment files seqformat replace the raw bases with common bases and delete the gaps according to the reference sequence. As one raw base has several common bases, then a base causing amino acid mutation will be randomly chosen. Here, HIV protease sequences of treatment-naive and treatment will be used as examples. > library(cormut) > examplefile=system.file("extdata","pi_treatment.aln",package="cormut") > examplefile02=system.file("extdata","pi_treatment_naive.aln",package="cormut" > example=seqformat(examplefile) > example02= seqformat(examplefile02) 3.2 Compute kaks for individual codon > result=kakscodon(example) mutation kaks lod freq 1 3 150.000000 52.827378 1.0000 2 10 8.500000 24.207139 0.7367 3 12 14.500000 5.106647 0.0967 4 19 2.538855 3.208047 0.1167 5 24 3.233179 3.131694 0.1000 6 35 4.300086 4.600433 0.2633 3.3 Compute kaks for individual amino acid mutation > result=kaksaa(example) position mutation kaks lod freq 1 3 V3I 404.841453111799 179.656919924817 0.9967 3

2 10 L10I 106.510984639863 194.782739610876 0.6033 3 12 T12N 3 Inf 0.01 4 12 T12P 61.1996817820207 4.49416430538097 0.0267 5 12 T12E 7 Inf 0.0233 6 12 T12S 38.2498011137629 2.04605159081416 0.0167 3.4 Compute the conditional kaks(conditional selection pressure) among codons > result=ckakscodon(example) 48 93 71 24 82 77 54 10 37 19 3 63 62 35 41 90 12 73 position1 position2 ckaks lod 1 3 10 17 24.2071393392468 2 3 12 29 4.07861778901451 3 3 19 1.84210526315789 3.20804706405523 4 3 24 3.33333333333333 3.13169395564291 5 3 35 13.1666666666667 4.6004332373075 6 3 37 142 31.4550748700319 3.5 Compute the conditional kaks(conditional selection pressure) among amino acid mutations > result=ckaksaa(example) 4

K70T L89M I47V S37E C95FI85V K45R I13V I64V L33I V32I I15V M36I G73S V82T R41K T12E T12S E35D I93L V77I M46II84V V3I L90M K20IL63P I62V T74S L19I L76V H69K L10I S37N S37D A71V V82A F53L R57K Q58E I54V G73T L24IM46L T12P G48V D60E S37H I72V K43T E34Q T12N position1 position2 ckaks lod 1 V3I L10I 13.92 226.78 2 V3I T12N 3 Inf 3 V3I T12P 8 10.89 4 V3I T12E 7 Inf 5 V3I T12S 5 6.8 6 V3I I13V 33 18.68 3.6 Compute the mutual information among codons An instance with two matrixes will be returned, they are mutation information matrix and p matrix, The ijth element of the N N MI matrix indicates the mutation information or p value between position i and position j. > result=micodon(example) 5

3 62 10 71 73 93 82 54 99 48 90 77 35 37 63 24 12 19 41 position1 position2 mi p.value 1 10 3 0.0223403798434836 0.02023988005997 2 19 12 0.128809592522988 0.00371553353757904 3 37 19 0.0747160966532552 0.0229275606099389 4 37 35 0.075806272221832 0.00371553353757904 5 41 37 0.06512044936218 0.00371553353757904 6 48 10 0.0897750853050767 0.00371553353757904 3.7 Compute the mutual information among individual amino acid mutations An instance with two matrixes will be returned, they are mutation information matrix and p matrix, The ijth element of the N N MI matrix indicates the mutation information or p value between two amino acid mutations. > result=miaa(example) 6

H69K I47V T12S I13V K70T V32I S37E S37H I85V L33I L76V V82T R41K L89M K45R M46I E35D V3I L63P C95F R57K M36I G73S Q58E L90M D60E I84V S37N I93L I15V V77I I72V I62V A71V L19I L10I G73T M46L S37D L24I K20I F53L I64V I54V V82A K43T T74S T12E T12N G48V T12P E34Q position1 position2 mi p.value 1 L10I V3I 0.00309067986573841 7.08455905999841e-40 2 T12N L10I 8.64708687707827e-05 0.0356066752680534 3 T12E L10I 0.000640519978094889 1.47993613367444e-09 4 I13V V3I 0.000389134320368023 2.79369604871158e-05 5 I13V T12S 0.00478659729991238 5.41272016874378e-63 6 I15V V3I 0.00051661973913858 6.61625526384279e-07 3.8 Compute the Jaccard index among individual amino acid mutations An instance with two matrixes will be returned, they are Jaccard index matrix and p matrix, The ijth element of the N N MI matrix indicates the Jaccard index or p value between two amino acid mutations. > result=jiaa(example) 7

T12P K14R T12E L19I I15V position1 position2 JI p.value 1 T12P K14R 0.192307692307692 0.0181798773738328 2 T12E I15V 0.162790697674419 0.00035221478070637 3 L19I I15V 0.228070175438596 0.00288369201418971 4 L19I T12E 0.172413793103448 0.0181798773738328 3.9 Comparison of the correlated mutations between two conditions by the means of correlated mutation network bicompare compare the correlated mutation relationship between two conditions, such as HIV treatment-naive and treatment. The result of bicompare can be visualized by plot method. Only positive selection codons or amino acid mutations were displayed on the plot, blue nodes indicate the distinct positive codons or amino acid mutations of the first condition, that is to say these nodes will be non-positive selection in second condition. While red nodes indicate the distinct positive codons or amino acid mutations appeared in the second condition. plot also has an option for displaying the unchanged positive selection nodes in both conditions. If plotunchanged is FALSE, the unchanged positive selection nodes in both conditions will not be displayed. > biexample=bickaksaa(example02,example) > result=bicompare(biexample) 8

L38F L38F E65D E65D T74S T74S H69N H69N L76V K14R S37A Q58E K20I G73T L24I K43TG48V I84VM46L G73S I54V V82A M46I E34Q L90M A71V L33I V32I C95F F53L I85V V82T K45R T12E L76V K14R S37A Q58E K20I G73T L24I K43TG48V I84VM46L G73S I54V V82A M46I E34Q L90M A71V L33I V32I C95F F53L I85V V82T K45R T12E I47V K70T L89M I47V K70T L89M References [1] Hurst, L. D. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends in genetics: TIG 18, 486 (2002). [2] Chen, L., Perlina, A., Lee, C. J. Positive selection detection in 40,000 human immunodeficiency virus (HIV) type 1 sequences automatically identifies drug resistance and positive fitness mutations in HIV protease and reverse transcriptase. Journal of virology 78, 3722-3732 (2004). [3] Cover, T. M., Thomas, J. A., Wiley, J. Elements of information theory. 6, (Wiley Online Library: 1991). 9