Molecular Population Genetics of Arabidopsis thaliana Ferulate-5-Hydroxylase and Flavanone-3-Hydroxylase Genes

PLSC 731 Plant Molecular Genetics Molecular Population Genetics of Arabidopsis thaliana Ferulate-5-Hydroxylase and Flavanone-3-Hydroxylase Genes Due: April 13, 2006, 11am DNA sequence data provides for the detailed population genetics analysis of genes. For this assignment, you will work in your groups to analyze the ferulate-5-hydroxylase and flavone- 3-hydoxylase genes of Arabidopsis thaliana. The appropriate data files for this analysis are available for download from the class WWW site. You are to use these data sets and prepare a report that: 1. presents data that supports your description of the degree of diversity and polymorphism among the individuals in this population for the two genes and their expressed proteins; 2. illustrates and discusses the relationship among the individuals in this population for these two genes and their expressed proteins with particular reference to geographical location; and 3. uses population genetics data to support a discussion regarding the evolution of these two loci. A report that contains two to three written pages (standard format) should be sufficient. It is up to each group to determine what they feel is sufficient detail. But remember, you are writing a report for a graduate class, so it is expected that it be representative of a short professional report. (This page length does not include the coversheet, references, and tables and figures you should also prepare for your report.) In addition, you are supplied with a number of papers that describe nucleotide variation among a number of different genes in A. thaliana. You should use these manuscripts as a guide for your report. You should also compare the results you obtain with these two genes with the genes described in these papers. In particular, the results in the Genetics (2000) 155:863 article should be of interest to you. In addition, your group should also consult the sequence diversity papers that we discussed in class. 1

1. Downloading the appropriate software You will need three software programs for this assignment. All of these are available from the WWW. Here is the software and the URL where you can download them. CLUSTALX 1.8.1: sent as an attachment TreeView 1.6.6: http://taxonomy.zoology.gla.ac.uk/rod/treeview.html DnaSP 4.10.4: http://www.ub.es/dnasp/ 2. Downloading the datafiles You will need four datafiles for this assignment. These are available from the following WWW site for this assignment. 3. Gathering basic information about the sequence diversity within your population a. To perform a comparative analysis of a collection of DNA or protein sequences, it is best to use a tool that readily shows the differences. Luckily such a tool exists that allows you to align the sequences and easily view the differences. The best way to identify the regions of homology is to align all of sequences for each gene or protein. To accomplish this, navigate to the Multalin WWW site. Multalin is an on-line software program that will align your sequences. The URL is: http://prodes.toulouse.inra.fr/multalin/multalin.html b. Open your nucleotide or amino acid for file for a particular gene and copy all of the records. Paste the data into the box below the Cut and paste your sequences here below. statement. c. Now you need to set the following parameters: If you are going to perform a DNA comparison, go to drop-down menu below Symbol comparison table and set the parameter to DNA-5-0 (Click elsewhere on the WWW page. Sometimes this is necessary to ensure the form records your symbol selection.) If you are going to perform an amino acid comparison, use the default Blosum62-12-2. Next go to the bottom of the page and find Maximum line length and set the value to 100 (This will ensure that you can see all of the alignment without the need of using a slide bar.) d. Now click on the Start Multalin! button. e. In a short time, a new page will appear with your alignment. For the nucleotide data, it will be informative to collect data relative to the amount of nucleotide changes. Remember that 2

you are using Arabidopsis lyrata (Aly-etc) as a reference. The remainder of the records, of the form Ath-etc, refer to different A. thaliana genotypes. f. It will be informative to determine if your differences occur in the exons or introns of the gene. In the file fah-f3h-cdna-genomic.txt, you will find the entire genomic and cdna sequence for these two genes. You can add these to the alignment (paste them into the data box along with the other sequences). This alignment should provide you with the necessary to determine where the introns are located, and thus, the distribution of the variation relative to the exons and introns. 4. Performing multiple alignment and building a phylogenetic tree Multalin! is a great program for visualizing your differences, yet it is not great for tree building. A good program for that task is ClustalX. Building your tree involves three steps. First you need to perform a multiple alignment in ClustalX. Next, you need to build the tree. And finally, you need to perform a bootstrap analysis of your tree. There a number of different tree algorithm, but ClustalX just supports neighbor joining distance tree method. a. You will first align the DNA nucleotide sequences. To accomplish this task, you will use the CLUSTALX software. This widely used software performs multiple alignments. The underlying algorithm is CLUSTALW. CLUSTALX is window-based interface to CLUSTALW. The alignment is developed in three steps. The first step involves a pairwise alignment of each sequence. These alignments are then used to develop a guide tree. Finally, the guide tree is used to create the multiple alignment. b. Open the CLUSTALX software from the location in which you placed it. Go to File/Load Sequence in the drop-down menus. Find your sequence f3h-nt.txt for example and select Open. c. Go to Alignment/Ouput Format Options in the drop-down menus and check the following boxes: CLUSTAL format GCG/MSF format NBRF/PIR format Leave the remainder of the options in their default settings and click CLOSE. Each of these output formats is useful in a number of programs. The CLUSTAL format is used by CLUTALX, whereas the NBRF/PIR format will be used in a later analysis. The GCG/MSF formatted output will be used in analyses beyond the scope of this assignment. 3

d. Go to Alignment/Alignment Parameters/Pairwise Alignment Parameters from the drop-down menus. These are the options used by the program for the first step, pairwise alignment. First, you can set the penalty for introducing a gap and the penalty for increasing the gap. For DNA alignments, set the following parameters to these values. These may be your default settings. Gap Opening: 15.00 Gap Extension: 6.66 Using the DNA Weight Matrix IUB is also appropriate. Click CLOSE. e. Go to Alignment/Alignment Parameters/Multiple Alignment Parameters Pairwise Alignments from the drop-down menus. Again, set the following parameters to these values. These may be your default settings. Gap Opening: 15.00 Gap Extension: 6.66 Using the DNA Weight Matrix IUB is also appropriate. Click CLOSE. f. Go to Alignment/Do Complete Alignment. The Output Guide Tree File: and Output Alignment Files: are the names that these will be given. These are outputted by CLUSTALX. Unless you change these settings, the files will have the same prefix as the input file (f3-nt, for example) and be given the extensions.dnd for the guide tree,.aln for the Clustal file,.pir for the NBRF/PIR file, and.msf for the GCG/MSF file. Click ALIGN. g. You can follow the progress of the alignment process at the very bottom of the interface. h. Phylogenetic trees are essential tools that depict the relationship between different sequences in your analysis. The trees show which genes are most closely related, and which are more distant. All methods are built under the assumption that those species with the fewest differences are most alike whereas those with the greatest differences are the most different. There are three basic types of phylogenetic trees. Distance methods use a distance matrix to determine those sequences with the smallest distance from each and then calculates the distance of each sequence to the node that joins them. It is calculated as a distance tree because the lengths of branches are defined distances. Distance methods, such as Neighbor-Joining, produce a single tree. Parsimony methods search for the single tree that uses the fewest number of 4

evolutionary steps. This procedure can generate a number of trees that contain the same number of evolutionary steps for their construction. Maximum likelihood procedures attempt to discover that tree that maximizes the probability of observing the data. We will use CLUSTALX to develop a Neighbor-Joining distance tree. Go to Trees/Output Format Options. Change the Bootstrap labels on: from BRANCH to NODE. This produces a cosmetic change in the manner in which the tree will be labeled. This will be important when you actually display the trees you develop. Click CLOSE. i. Go to Trees/Draw N-J Tree. The SAVE PHYLIP TREE AS: is the name that will be given to the N-J tree. It uses the same file naming convention as described in 4.f, except the file extension is.ph. Click OK. j. Trees are built based on the alignment. But is important to provide a level of statistical confidence to the trees that you build. The standard statistical analysis to provide a confidence level is called bootstrapping. Bootstrapping is an iterative process in which subsets of the data are reanalyzed to determine how frequently certain entries are grouped together. The more often entries are grouped together during the bootstrap process, the higher the confidence you can have that sequences have a degree of relatedness. Go to Trees/Bootstrap N-J Tree. The Random number generator seed [1-1000] is a value that seeds the beginning of the bootstrap analysis. (If you are reanalyzing the same data set over and over, you should change this number each time. It can be any number between 1 and 1000.) The Number of bootstrap trials [1-1000] is self-explanatory. Before the advent of computers with high-speed processors, it was computationally expensive to perform a large number of bootstrap trials. That is not the case today. Therefore, you should use the default value of 1000 trials. A second output tree is generated. This tree will contain the bootstrap values on the nodes. This tree is given a.phb extension. k. Be sure you perform this analysis for both nucleotide data sets. 5. Performing protein amino acid multiple alignments and tree building a. CLUSTALX is also used to align protein amino acid sequences. Steps a through g are nearly identical to those described in section 4 for the DNA nucleotide alignments. Here are the differences you need to implement i. Load sequence f3-aa.txt, for example. ii. Go to Alignment/Alignment Parameters/Pairwise Alignment Parameters from the drop-down menus. Set the Gap Opening to 35 Gap Extension 0.75 5

Select BLOSUM in the Protein Weight Matrix. iii. Go to Alignment/Alignment Parameters/Multiple Alignment Parameters from the drop-down menus. Set the Gap Opening to 15 Gap Extension 0.30 Select BLOSUM in the Protein Weight Matrix. b. CLUSTALX is also an appropriate tool to create a phylogenetic tree for amino acid sequences. Follow steps 4h through 4j above to create your amino acid tree. c. Be sure to analysis both amino acid data sets. 6. Viewing the phylogenetic trees a. To see the trees that you have developed, you need to use the TreeView program. This program is specifically designed to view phylogenetic trees calculated by programs such as CLUSTALX. To view the trees, open the Treeview program. b. Go to File/Open and select the f3-nt.php file (for example). c. By default, the tree appears as a Slanted Cladogram. To depict this as a distance tree, go to Tree/Phylogram. (You can also click the rightmost tree shape on the group of tree shapes on the menu bar.) To show your bootstrap values, go to Tree/Show internal edge labels. (Alternatively, you can click the menu bar button with 12 surrounded by a tree cluster.) d. Lastly, you need to save the tree in a format that can be viewed by drawing programs. To do this, go to File/Save as graphic In the menu box type the following file name f3h-nt-tree-graphic (as example) Leave the file type as.emf. e. Now, you need to view and save the protein tree. Open the appropriate.php file for each protein data set. 6

f. Repeat steps c and d above for the protein file. Save the file as f3h-aa-tree-graphic (as example) g. Be sure that you perform the tree analysis for both nucleotide and amino acid data sets. 7. Population genetic analysis of the gene a. The last step of the assignment is to perform a comparative analysis of the data. This will be done using the DnaSP software. This software can calculate several important population genetics that can give an indication of the variability of the gene sequences, and the types of selection pressure the gene is undergoing. Open the DnaSP software from the location in which you placed it. f3h-nt.pir b. Go to File/Open Data File #1 For example, choose the file Click Close on the Data Information box. c. Rather than recording the results of each analyze as you go, DnaSP can create an output in which you can store all of your analyses. You can then go back to this text file later to collect the data you need for your report. Go to File/Send All Output to File Name your file f3h-dna, for example and hit Save. The program will append this file name with the file type extension.out. Click OK in the pop-up box. d. To view the data, go to Display/View Data. This will display your data. The display is restricted to just 28 nucleotides. e. Go to Analysis/DNA Polymorhphism. Click OK. Study this output (but remember the data will also be found in you output file. You will also use this information as you fill the described in section 9 below. Haplotypes can be defined as a unique set of sequences. From a genetics perspective, these are nucleotide variants that reside so close to each other that that are inherited as a unit. If two sequences are identical, then they are considered to be the same haplotype. Nucleotide variation can expressed as diversity or polymorphism. Diversity (π) is estimated as the average number of nucleotide substitutions per site when two sequences are compared. The value DnaSP reports is the average of all two sequence comparisons. [To read more about this and other analyses performed by DnaSP go to Help/Contents (or hit the F2 key) and click on the analysis of interest.] Polymorphism (θ) is estimated as the proportion of nucleotide sites that 7

are polymorphic. The statistic considers all of the data as a group. Because is expressed on a per site basis, it is appropriate to use this statistic to compare variation at two different genes. f. Go to Analysis/Tajima s Tests Click OK. This analysis provides indications whether the gene sequence is undergoing selection or whether the variation was generated by mutation and drift (the neutral theory). Typically, this is performed within a species, but it is such an important test that you should perform it on your data set to gain some understanding of the important principles underlying it. Darwin s landmark book made the case for natural selection as the driving force for the generation of variation. This theory states that in a large population experiencing fitness and viability restraints because of limited resources in the environment, those alleles which provide the population with an advantage in the environment will appear at a higher frequency (be selected) in the next generation. After many generations, that allele will become the prominent allele in that population. This is in contrast to the neutral theory that states that a balance between mutation and genetic drift lead to the fixation of nucleotide changes. This theory suggests that the random process that drive genetic drift (and not selection) lead to the increase in frequencies of specific alleles in the subsequent generation. The cumulative sum of these effects over generations leads to the fixation of certain mutation events. A number of statistical tests have been developed to determine if a gene is being neutrally selected. One popular test of neutrality based on nucleotide data is Tajima s D. Without going into great depth, this estimate is a corrected ratio of diversity divided by polymorphism. This value can be either positive (if π > θ) or negative (f π < θ). If the D value is too large or too small (based on probability of obtaining a specific value), then the assumption of neutrality at the locus is rejected. It is then generally assumed that this is one indication that the locus is evolving by selection. But, it is important to note that other population effects, such as bottlenecks, can lead to significant D tests. g. Go to Analysis/Fu and Li s (and other) Tests Click OK. This is another test of neutrality. We discussed this in class, and you have read about this in the papers we discussed in class. h. Go to Analysis/Recombination Click OK. This is a test for the number of recombination events which can be defined based on four-gamete types. The four gametes represent the two parental and the two recombinant chromosome types. A computer simulation is used to determine the minimum number of recombination events. Evidence of recombination indicates that ancestral populations must have existed that represented the variation of served in this population. 8

i. Go to File/Close Output File: f3h-nt.out. This closes your data output file. You can now read that file in TextPad (or another program that reads.txt files). j. Make sure you complete these analyses for both genes. You just want to analyze the nucleotide data. 8. Important geographic information a. You are to determine if there is some geographic relationship among your genotypes for each nucleotide or amino acid sequence. The following table provides information regarding the collection location for each of the genotypes analyzed in this data set. Table 1. Geographic collection site for each of the Arabidopsis genotypes analyzed in this assignment. Genotype Country of origin A. lyrata Russia A. thaliana CAN-0 Canary Islands, Spain A. thaliana CHA-0 Champex, Switzerland A. thaliana COL-2 Landsberg, Poland A. thaliana COND Condara, Khurmatov, Tadjikistan A. thaliana CVI-0 Cape Verdi A. thaliana GR-5 Graz, Austria A. thaliana ITA-0 Ibel Tazekka, Morocco A. thaliana KAS-1 Kashmir, India A. thaliana LA-0 Landsberg, Poland A. thaliana ME-0 Mechtshausen, Germany A. thaliana MH-0 Muehlen, Poland A. thaliana MR-0 Monte/Tosso, Italy A. thaliana NC-1 Ville-en-Vermois, France A. thaliana PER-1 Perm, Russia A. thaliana RI-0 Richmond, British Columbia, Canada A. thaliana RSCH-0 Rschew/Starizd, Russia A. thaliana RUB-1 Rubezhnoe, Ukraina A. thaliana TUL-0 Turk Lake, Florida A. thaliana WS-0 Vasljevici/Drijept, Byelorussia A. thaliana YO-0 Yosemite, California, USA 9

9. Complete this data file Your output from the DnaSP program will a tremendous amount of data that will assist with your analysis of these data sets. From the output file, you will be able to complete the follow table. Number sequences Number total sites Polymorphic sites Haplotypes Nucleotide variation Diversity (π/bp) Polymorphism (θ/bp) Neutrality tests Tajima s D Fu and Li s D* Fu and Li s F* Recombination # four gamete type site pairs R M b f3h fah 10