Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Bioinformatics Sequence Analysis An introduction Part 8 Mahdi Vasighi

Sequence analysis Some of the earliest problems in genomics concerned how to measure similarity of DNA and protein sequences, either within a genome, or across the genomes of different individuals, or across the genomes of different species. Why sequence similarity is important? Similar sequence = similar information = same ancestor Similar sequence = similar structure = similar function

Homology of sequences Homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) a duplication event (paralogs)

Mutation Mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus. Environmental factors Radiation Chemicals Mistakes in replication or repair

Mutation Classification of mutation types: By effect on structure Small scale mutations Large scale mutations By impact on protein sequence

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another. Transition AGGCGTATTGCATCGTTAAACGCGC AGATGTATTGCGTCGCTAAACGCGC Most common is the transition that exchanges a purine for a purine (A G) or a pyrimidine for a pyrimidine, (C T)

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another. Transversion AGGCGTATTGCATCGTTAAACGCGC AGTGGTATTGCCTCGATAAACGCGC Less common is a Transversion, which exchanges a purine for a pyrimidine or a pyrimidine for a purine (C/T A/G).

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Silent mutation M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ATC GCA TCA TTG AAC TAA C... M R I A S L N Stop New codon translated for the same amino acid

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Missense mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ACT GCA TTG TTA AAC TAA C... M R T A L L N Stop New codon translated for different amino acids.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Non-sense mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ATT GCA TCG TAA AAC TAA C... M R I A S Stop New codon translated for a stop and can truncate the protein.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Conservative mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... Non-polar (Hydrophobic) ATG CGT ATT GCA TCG TTC AAC TAA C... M R I A S F N Stop A change in a DNA or RNA sequence that leads to the replacement of one amino acid with a biochemically similar one.

Mutation Classification By effect on structure Large scale mutations occur in chromosomal structure:

Mutation

Sequence Alignment Sequence comparison can be used: To establish evolutionary relationships among organisms To comparison may allow identification of functionally conserved sequences To find structural similarity To identify corresponding genes in organisms Sequence Alignment is Arranging two or more sequences (DNA, RNA or protein) by searching for a series of individual characters or character patterns that are in the same order in the sequences. ATGCGTATTGCATCGTTAAACTAA ATGCGTATTGCA---TTAAACTAA AT-CGT---GCATCGTTAAACTAA

Sequence Alignment ACGTCTAG ACTCTAG- 2 matches 5 mismatches 1 not aligned ACGTCTAG -ACTCTAG 5 matches 2 mismatches 1 not aligned ACGTCTAG AC-TCTAG 7 matches 0 mismatches 1 not aligned this seemingly simple alignment operation is not as simple as it sounds!...aactgagtttacgctcataga... T---CT-A--G How can we measure distance between two strings or biological sequences? edit distance

Sequence Alignment There are two types of sequence alignment: 1. Global alignment T T G C G T A T T G C A T C G T T G C C T T T T C C A T - - 2. Local alignment - - - - - T A C G - - - - - - - - - - - T T C G - - - - - -

Sequence Alignment Our approach is guided by biology: It is possible for evolutionarily related proteins and nucleic acids to display substitutions at particular positions Substitution (point mutation) Insertion of short segments Deletion of short segments Segmental duplication Inversion Translocation GTATTGCATCGTTAAA GTATTGCA---TTAAA Insertions and/or deletions are called indels. Comparing two genes, it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known.

Pairwise Sequence Alignment Alignment of two sequences is performed using the following methods: 1. Dot matrix analysis 2. The dynamic programming (DP) algorithm 3. Word or k-tuple methods, such as used in BLAST and FASTA

Pairwise Sequence Alignment Dot matrix analysis A dot matrix analysis is primarily a method for comparing two sequences to look for possible alignment of characters between the sequences. The method is also used for: finding direct or inverted repeats in protein and DNA sequences, predicting regions in RNA that are self-complementary

Pairwise Sequence Alignment Dot matrix analysis A T G C C C A T A G T T G C A T A G Any region of similar sequence is revealed by a diagonal row of dots. Isolated dots not on the diagonal represent random matches that are probably not related to any significant alignment.

Pairwise Sequence Alignment Dot matrix analysis Detection of matching regions may be improved by filtering out random matches in a dot matrix by defining a window: T T G C A T A G A T G C C C A T A G Window size = 3 Stringency = 2 typical window size for DNA sequences is 15 and a suitable match requirement in this window is 10. For protein sequences, the matrix is often not filtered, but a window size of 3 and a match requirement of 2 will highlight matching regions

Pairwise Sequence Alignment Dot matrix analysis A T G C C C A T A G T T G C A T A G A T G C C C A T A G T T G C - - A T A G

Pairwise Sequence Alignment Dot matrix analysis G C T A G T C G A T G C T G A T C G A T G C T - G A T C G - - G C T A G - T C G

Pairwise Sequence Alignment Dot matrix analysis Seq 1 Seq 2 G C T A G T G C T G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis Seq 2 G C T A G T Seq 1 G C T - G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis Seq 1 Seq 2 G C T A G - T G C T - G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis A T C G T G A T C G A T C G T G A T C G

Pairwise Sequence Alignment Dot matrix analysis Gene 1 Gene 1 Gene 2 Gene 2

Pairwise Sequence Alignment Dot matrix analysis MATLAB (matrix laboratory) is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran. Bioinformatics Toolbox offers an integrated software environment for genome and proteome analysis. In particular, it provides access to genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis.

Pairwise Sequence Alignment Dot matrix analysis getgenbank S = getgenbank('m10051') S= LocusName: 'HUMINSR' LocusSequenceLength: '4723' Purpose: Retrieve sequence from GenBank database LocusNumberofStrands: '' Syntax: Data = getgenbank('accessionnumber',... LocusTopology: 'linear' LocusMoleculeType: 'mrna' 'PropertyName',PropertyValue...) Unique Accession: identifier 'M10051' for a sequence Version: record. 'M10051.1' Example S = getgenbank('m10051') CDS: [139 4287] S = getgenbank('m10051, sequence,true) LocusGenBankDivision: 'PRI' LocusModificationDate: '06-JAN-1995' Definition: 'Human insulin receptor mrna, complete cds.' GI: '186439' Keywords: 'insulin receptor; tyrosine kinase.' Segment: [] Source: 'Homo sapiens (human) SourceOrganism: [3x65 char] Reference: {[1x1 struct]} Comment: [14x67 char] Features: [51x74 char] Sequence: [1x4723 char] SearchURL: [1x105 char] RetrieveURL: [1x95 char]

getembl Purpose:Retrieve sequence information from EMBL database pdbstruct = Identification: [1x1 struct] Syntax: EMBLData = getembl(accessionnumber) Example Pairwise Sequence Alignment Dot matrix analysis emblout = getembl( X00558 ) Compound: [4x23 char] Source: DateUpdated: [4x38 char] [1x46 char] mblout = getembl( X00558, ToFile, c:\project\rat_protein.txt ) getpdb Purpose: Retrieve protein structure Remark1: data Reference: [1x1 struct] from {[1x1 Protein struct]} Data Bank (PDB) database DatabaseCrossReference: Remark2: [1x1 struct] '' Remark3: Comments: [1x1 struct] '' Syntax: PDBStruct = getpdb(pdbid) Assembly: '' Example pdbstruct = getpdb( 5CYT ) emblout = getembl('x00558') pdbstruct = getpdb('5cyt') emblout = CYTOCHROME C' Created)' Header: Accession: [1x1 struct] 'X00558' SequenceVersion: Title: 'REFINEMENT 'X00558.1' OF MYOGLOBIN AND DateCreated: '13-JUN-1985 (Rel. 06, Keywords: Description: 'ELECTRON 'Rat TRANSPORT liver (HEME apolipoprotein PROTEIN)' A-I mrna (apoa-i)' ExperimentData: 'X-RAY Keyword: DIFFRACTION' [1x44 char] Authors: OrganismSpecies: 'T.TAKANO''Rattus norvegicus (Norway RevisionDate: rat)' [1x2 struct] OrganismClassification: Superseded: [1x1 struct] [3x75 char] Journal: Organelle: [1x1 struct] '' Remark4: [2x59 char] Remark100: [2x59 Feature: char] [22x75 char] Remark200: BaseCount: [49x59 char] [1x1 struct] Remark280: Sequence: [6x59 char] [1x877 char] RetrieveURL: [1x64 char]

Pairwise Sequence Alignment Dot matrix analysis seqdotplot Purpose: Create dot plot of two sequences Syntax: seqdotplot(seq1,seq2) seqdotplot(seq1,seq2, Window, Number) Enter an integer for the size of a window. an integer for the number of characters within the window that match. Example Prion protein is a small protein found in high quantity in the brain of animals infected with moufflon = getgenbank('ab060288','sequence',true); mad-cow disease takin = getgenbank('ab060290','sequence',true); seqdotplot(moufflon,takin,11,7)

Pairwise Sequence Alignment Dot matrix analysis

Pairwise Sequence Alignment Dot matrix analysis http://myhits.isb-sib.ch/cgi-bin/dotlet

nt2aa Pairwise Sequence Alignment Dot matrix analysis Purpose: Convert nucleotide sequence to amino acid sequence Syntax: SeqAA = nt2aa(seqnt) Example S = getgenbank('m10051, sequence,true) p = nt2aa(s.sequence([[2699:2885],[3673:3818]])) >> S.CDS ans = location: 'join(2699..2885,3673..3818)' gene: 'INS' product: 'insulin' codon_start: '1' indices: [2699 2885 3673 3818] protein_id: 'AAA59173.1' db_xref: 'GI:386829' note: '' translation: [1x110 char] text: [9x58 char]

Pairwise Sequence Alignment Dot matrix analysis nt2int Purpose: Convert nucleotide sequence from letter to integer representation Syntax: SeqInt = nt2int(seqchar) Example s = nt2int('actgctagc') randseq Purpose: Generate random sequence from finite alphabet Syntax: Seq = randseq(seqlength) Example S = randseq(20) S = randseq(20,'alphabet', amino')

Pairwise Sequence Alignment Dot matrix analysis molviewer Purpose: Display and manipulate 3-D molecule structure Syntax: molviewer molviewer(pdbid) Example molviewer molviewer('5cyt')

Pairwise Sequence Alignment Dot matrix analysis -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- WHICH-ONE-IS-BETTER?---

Find the coding sequence of Hemoglobin subunit beta for Human, Chimpanzee and rat. Analysis them using MATLAB dotplot tool