Profiles. Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH.
|
|
- Winifred Hawkins
- 5 years ago
- Views:
Transcription
1 Profiles Tore Samuelsson Nov 9 Background Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH. However, these methods are not always sufficient. In many cases the amino acid sequences of related proteins have diverged significantly, although the fold of the proteins is preserved.
2 Amino acid sequences may change rapidly during evolution although D structure is preserved Species A ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G % Species B ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 9% Species C M V R I Q K I N E K G A L L A G 8% Species D Q V R I Q K I Y E K G A L L A A 9% ( twilight zone ) Species E Q V R I Q K I Y E K T A L L F A % ( midnight zone ) In a BLAST search evolutionary related proteins may have very poor E-values Sequences producing significant alignments: Score E (bits) Value SRC_HUMAN (P9) Proto-oncogene tyrosine-protein kinase Src (E... e- YES_HUMAN (P9) Proto-oncogene tyrosine-protein kinase Yes (E... e-8 FYN_HUMAN (P) Proto-oncogene tyrosine-protein kinase Fyn (E... e- FGR_HUMAN (P99) Proto-oncogene tyrosine-protein kinase FGR (E... 9 e- HCK_HUMAN (P8) Tyrosine-protein kinase HCK (EC...) (p... e- LCK_HUMAN (P9) Proto-oncogene tyrosine-protein kinase LCK (E... 8e- LYN_HUMAN (P98) Tyrosine-protein kinase Lyn (EC...) e- BLK_HUMAN (P) Tyrosine-protein kinase BLK (EC...) (B... e- FRK_HUMAN (P8) Tyrosine-protein kinase FRK (EC...) (N... 9 e SHC_HUMAN (Q99) SHC transforming protein (SH domain prote.... SHA_HUMAN (Q9NP) SH domain protein A (T cell-specific adap.... CHIN_HUMAN (P88) N-chimaerin (NC) (N-chimerin) (Alpha chimeri.... APS_HUMAN (O9) SH and PH domain-containing adapter protein.... CISH_HUMAN (Q9NSE) Cytokine-inducible SH-containing protein (C.... SOCS_HUMAN (O) Suppressor of cytokine signaling (SOCS-) CHIO_HUMAN (P) Beta-chimaerin (Beta-chimerin) (Rho-GTPase-a LIPL_HUMAN (P88) Lipoprotein lipase precursor (EC...) ( SOCS_HUMAN (O9) Cytokine inducible SH-containing protein.... ATPF_HUMAN (Q8NM) ATP synthase mitochondrial F complex assem.... TENS_HUMAN (Q9HBL) Tensin.8 FBWA_HUMAN (Q9Y9) F-box/WD-repeat protein A (F-box and WD-re....8 STAT_HUMAN (P) Signal transducer and activator of transcri.... SOCS_HUMAN (O) Suppressor of cytokine signaling (SOCS-).... Profile-based searches are more efficient in identifying remote sequence similarity.
3 Profiles are generated from multiple alignments, and they are one of many applications of multiple alignments. * Profiles * Identify conserved motifs - patterns (PROSITE) * Phylogenetic studies * Prediction of protein secondary structure Multiple alignment generated by methods like Clustalw andtcoffee Terminology Profile. A matrix where the numbers reflect the probabilities of characters appearing in a certain position in a multiple alignment PSSM "Position specific scoring matrix". More of less synonymous with 'profile' but sometimes a 'profile' refers to a matrix where also gaps are taken into account. Sometimes also called Weight or frequency matrix
4 Position-Specific Scoring Matrix (PSSM) Multiple alignment of ' splice site sequences GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT 8 CAGGTATAC 9 TGTGTGAGT AAGGTAAGT Calculate the absolute frequency of each nucleotide at each position PSSM GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT 8 CAGGTATAC 9 TGTGTGAGT AAGGTAAGT A C G T 8 9
5 Calculate the relative frequency of each nucleotide at each position PSSM 8 9 A C G T 8 9 A C G T PSSM What is the probability of finding CAGGTTGGA? The product of the frequency of each nucleotide at each position:. *. *. * * *. *. *. *. 8 9 A C G T
6 Compute the log odds ratios log(m ij /P i ) M ij = probability of nucleotide i at position j P i = background probability of of nucleotide i For this example we assume P i =. 8 9 PSSM A C G T A C G T Scoring with a PSSM PSSM We want to analyze the sequence GTAGTAGAAGGTAAGTGTCCGTAG with the profile 8 9 A C G T We examine a window the size of the profile GTAGTAGAAGGTAAGTGTCCGTAG
7 T G C A 9 8 PSSM Find the score for GTAGTAGAAGGTAAGTGTCCGTAG (- ) GTAGTAGAAGGTAAGTGTCCGTAG (8.) GTAGTAGAAGGTAAGTGTCCGTAG (.) Pseudocounts T G C A 9 8 T 8 G C 8 A 9 8 PSSM
8 PSSM Position-Specific Scoring Matrix (PSSM) Pseudocounts With a very large number of sequences in the multiple alignment an observed amino acid/nt frequency is expected to be approximately equal to the actual probability of finding that amino acid/nt. However, in most cases the number of sequences are limited so that for some amino acids/nts the observed frequency = whereas the actual probability should be >. For this reason fake counts, pseudocounts, are added to avoid zero probability. For instance, one simple solution is to add to all counts. PSSM Pseudocounts More sophisticated : q u, a nu, N a p seq a where q u,a = estimated probability of residue type a occuring in column u p a = frequency of occurrence of residue type a (based on composition of proteins/dna) n u,a = count of residues a in column u N seq = total number of sequences scaling parameter 8
9 Representing profile as a sequence logo Amount of uncertainty in column u: H u fu, a log fu, a a where H u is the uncertainty at position u, a is one of the four bases, or in the case of proteins, one of the amino acids. f u,a is the frequency of base (amino acid) a in column u. Total information at the position u is represented by the decrease in uncertainty : I u = log - H u (proteins) I u = log - H u (DNA) where I u is the amount of information present at column u, and log (or log ) is the maximum uncertainty at any given position. The entire set of I au values forms a curve that represents the importance of various positions. The height of this curve is the height of the logo at that position. The size of each base/amino acid printed in a logo is determined by multiplying the frequency by the total information at that position: Height of base/amino acid a at position u = P au I u The bases/amino acids are then stacked on top of each other in increasing order of their frequencies and plotted. Sequence logo example. Consider the simple amino acid multiple sequence alignment: Seq Seq Seq Seq We use A A A A A S A G T A G G H f, log f, u a u a u a for each of the columns of the multiple alignment. H = - * log () = H = - ((. * log.) +(. * log.)) = H = -(( *. * log.)) = Total height of columns: I u = log - H u I =. - I =. - I =. - 9
10 Sequence logo example, cont. Height of A at position = f A * I = * I Height of A at position = f A * I =. * I Height of G at position = f G * I =. * I Height of A at position = f A * I =. * I etc. I =. I =. I =. This is the sequence logo obtained at if the alignment above is used, and "small sample correction" is deselected. Sequence logo ' splice site example
11 Sequence logo translation start site in bacteria Methods that take into account position-specific information from multiple alignments 99 PSI-BLAST (Altschul et al) ~99 Profile HMMs (S Eddy) => HMMER software
12 Principle of PSI-BLAST Query sequence "Normal" BLAST search Query sequence Database hits Evalue cutoff A C D.. Y Use hits above cutoff PSSM iterate BLAST search with PSSM Database hits Use hits above cutoff PSI-BLAST PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps: () PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. () The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. () The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly. () PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. () Finally, PSI-BLAST iterates, by returning to step (), an arbitrary number of times or until convergence. Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass. Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.
13 PSI-blast - Constructing the profile Query-anchored multiple alignment Query MKDRNLGEK Sbjct MKD-NLAEK Query MKD-RNLGEK Sbjct MKEARNLAEK Pairwise alignments from PSI-blast Query MKD-RNLGEK Sbjct MKD--NLAEK Sbjct MKEARNLAEK disregarded PSI-blast - Constructing the profile
14 Psiblast tutorial
15
16 "This analysis illustrates not only how the search for sequence relatives can reveal the function of a protein, but also how similarity searching serves to unify formerly disparate members of a database" Yeast Pop Pop Pop Rpp Rpr Pop Pop Pop8 Pop Man hpop Rpp9 hpop Rpp Rpp Rpp Rpp8 Rpp Rpp
17 Outcome of PSI-BLAST is dependent on query sequence: Only some Pop homologues as query identifies Rpp Results from round Sequences producing significant alignments: Sequences used in model and found again: Score E (bits) Value POP_Pichia_stipitis 8_pichia_stipitis_FM.aa.fasta unnamed p... e- ref XP_9. PREDICTED: similar to ribonuclease P kda subu... e- ref XP_98. PREDICTED: similar to ribonuclease P (predicte... e- ref NP_. ribonuclease P kda subunit [Homo sapiens] >gi... e- ref XP_. PREDICTED: similar to RPP protein [Pan troglo... 9 e- dbj BAA98. unnamed protein product [Homo sapiens] 9 e- POP_Pichia_guilliermondii supercont_ Minus (of... e- ref XP_. PREDICTED: ribonuclease P (predicted) [Rattus... e- gb AAH8. Ribonuclease P kda subunit [Mus musculus] >gi... e- ref XP_8. PREDICTED: similar to ribonuclease P kda subu... e- gb AAH8. MGC8 protein [Xenopus laevis] e-8 ref XP_898. PREDICTED: similar to RPP protein [Gallus gal... 9e-8 TPRQKVAIIY DVGVSTLYKR FP IPRKQVAIIY DVAVSTLYKK FP HPRQQLAIIF GIGVSTLYRY FP GSKTKLAQAA GIRLASLYSW KG TTFKQIALES GLSTGTISSF IN IPYQEFAKLI GKSTGAVRRM ID VTLQQFAELE GVSERTAYRW TT FTYNQYAQMM NISRENAYGV LA LGASHISKTM NIARSTYVKV IN TGATEIAHQL SIARSTVYKI LE ISISAIAREF NTTRQTILRV KA GNISALADAE NISRKIITRC IN MVLADIAQAV EMHESTISRV TT LVLHDIAEAV GMHESTISRV TT LNLRIVADAI KMHESTVSRV TS MTRGDIGNYL GLTVETISRL LG LSLSALSRQF GYAPTTLANA LE MSLAELGRSN GLSSSTLKNA LD FDIASVAQHV CLSPSRLSHL FR LRIDEVARHV CLSPSRLAHL FR VTLEALADQV AMSPFHLHRL FK VLYPDIAKKF NTTASRVERA IR Profiles: Example with HTH (helix turn helix) motif
18 Result of scoring with HTH profile >lcl AADR_RHOPA (Q98) Transcriptional activatory protein aadr (Anaerobic aromatic degradation regulator) >lcl AGLR_RHIME (Q9ZR) HTH-type transcriptional regulator aglr >lcl ANR_PSEAE (P9) Transcriptional activator protein anr >lcl ARAC_ERWCH (P) Arabinose operon regulatory protein >lcl ASCG_ECOLI (P) HTH-type transcriptional regulator ascg (Cryptic asc operon repressor) >lcl CCPA_BACME (P88) Glucose-resistance amylase regulator (Catabolite control protein) >lcl CCPA_BACSU (P) Catabolite control protein A (Glucose-resistance amylase regulator) >lcl CCPA_STRMU (O9) Probable catabolite control protein A >lcl CCPB_BACSU (P) Catabolite control protein B >lcl CENPB_CRIGR (P8988) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl CENPB_HUMAN (P99) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl CENPB_MOUSE (P9) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl DEGA_BACSU (P9) HTH-type transcriptional regulator dega (Degradation activator) >lcl DEOR_BACSU (P9) Deoxyribonucleoside regulator >lcl EBGR_ECOLI (P8) HTH-type transcriptional regulator ebgr (Ebg operon repressor) >lcl ENDR_PAEPO (P8) Probable HTH-type transcriptional regulator endr >lcl ETRA_SHEON (P8) Electron transport regulator A >lcl FECI_ECOLI (P8) Probable RNA polymerase sigma factor feci >lcl FLP_LACCA (P98) Probable transcriptional regulator flp >lcl FNRA_PSEST (P) Transcriptional activator protein fnra >lcl FNRN_RHILV (P9) Probable transcriptional activator (ORF-) >lcl FNR_ACTAC (Q9EXQ) Anaerobic regulatory protein >lcl FNR_ECO (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_ECOL (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_ECOLI (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_HAEIN (P99) Anaerobic regulatory protein >lcl FNR_KLEOX (Q9AQ) Fumarate nitrate reduction regulatory protein >lcl FNR PASMU (Q9CMY) Anaerobic regulatory protein Methods that take into account position-specific information from multiple alignments 99 PSI-BLAST (Altschul et al) ~99 Profile HMMs (S Eddy) => HMMER software 8
19 Profile HMMs HMMER software package hmmbuild Build a model from a multiple sequence alignment. hmmpfam Search an HMM database for matches to a query sequence. hmmsearch Search a sequence database for matches to a single profile HMM. Pfam database - attempt to completly and accurately classify protein families and domains "All science is either physics or stamp collecting" Ernest Rutherford 9
20 Profile HMMs Pfam database Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against Pfam library of HMMs you can find out its domain architecture. Pfam may also be used to analyse proteomes and domain architectures. Two categories of families: Pfam-A families are manually curated HMM based families which are built using an alignment of a small number of representative sequences ('seed' alignment). A threshold is manually set for each HMM, and this determines the minimum score a sequence must attain to belong to the family. HMMs are searched against the UniProt database, and include all sequences that score above the cut-off value for a particular family in the family's full alignment. Pfam-A matches are very unlikely to be false matches. Profile HMMs Pfam database Pfam-B. To complement the Pfam-A families, Pfam-B families are automatically generated using the PRODOM database. Pfam-B families are formed by taking alignments of sequence segments from PRODOM, and removing any Pfam-A residues from them. (PRODOM is a database of protein domain sequence families constructed using PSI-BLAST analysis of protein sequences as well as using information from the SCOP database.) All families in Pfam are non-overlapping such that no amino acid belongs to more that one family/domain. Two HMMs for each Pfam entry. For each Pfam entry two HMMs are built, one to represent full length matches (ls model), and one to represent fragment matches (fs model).
21 Complexity of Pfam, PfamA families,, protein sequences in Uniprot analyzed => on average ~. PfamA domains/protein a total of, different architectures
22 Databases related to Pfam PRODOM CDD SMART smart.embl-heidelberg.de/ INTERPRO combines information from : Pfam Prints SMART Prosite PRODOM CDD
23 SMART
24 Databases like InterPro have aided considerably in the annotation of the human genome
25 Exercises Compare pw alignment methods BLAST FASTA SSEARCH to profile methods PSI-BLAST Hmmer Protein domain studied : SH domain originally found in oncoproteins Src and Fps. SH domains are found in many proteins taking part in signal transduction pathways. The function of SH domains is to specifically recognize the phosphorylated state of tyrosine residues, thereby allowing SH domain-containing proteins to localize to tyrosine-phosphorylated sites
26 SH domain Step Extract the SH domain from the human SRC protein % extractseq src_human.fa Step Pw alignment methods Step ) BLAST see PSI-blast step below ) FASTA % fasta [input_file] [database] > result ) SSEARCH % ssearch [input_file] [database] > result Profile-based methods PSI-blast % blastpgp -i [input_file] -d [database] -j -o output_file (st round : normal BLAST search) Hmmer % hmmsearch shfs.hmm [database] > result_file where shfs.hmm is the HMM profile for the SH domain
27 HIT proteins / uridylyltransferases The Histidine Triad (HIT) motif, His-phi-His-phi-His-phi-phi (phi, a hydrophobic amino acid) was identified as being highly conserved in a variety of species. Proteins in the HIT superfamily are conserved as nucleotide-binding proteins, and are structurally related to a family of enzymes that includes GalT, a uridylyltransferase. This relationship was first revealed by structural analysis, but may also be detected using PSIblast. Relationship of ATP- and NAD-dependent DNA ligases ATP-dependent DNA ligases: Eukarya, Archaea NAD-dependent DNA ligases: Eubacteria Previously these enzymes were believed to be evolutionary unrelated but PSI-blast provides evidence that they are related.
Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013
Patterns and profiles applications of multiple alignments Tore Samuelsson March 3 Protein patterns and the PROSITE database Proteins that bind the nucleotides ATP or GTP share a short sequence motif Entry
More informationCAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationCSCE555 Bioinformatics. Protein Function Annotation
CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationMotifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC
Motifs, Profiles and Domains Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC Comparing Two Proteins Sequence Alignment Determining the pattern of evolution and identifying conserved
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationEBI web resources II: Ensembl and InterPro
EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationSimilarity searching summary (2)
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
More informationGenome Annotation Project Presentation
Halogeometricum borinquense Genome Annotation Project Presentation Loci Hbor_05620 & Hbor_05470 Presented by: Mohammad Reza Najaf Tomaraei Hbor_05620 Basic Information DNA Coordinates: 527,512 528,261
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationChristian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel
Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a
More informationAmino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1
Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings
More informationAmino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)
Amino Acid Structures from Klug & Cummings 2/17/05 1 Amino Acid Structures from Klug & Cummings 2/17/05 2 Amino Acid Structures from Klug & Cummings 2/17/05 3 Amino Acid Structures from Klug & Cummings
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationMultiple sequence alignment
Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationHMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder
HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding
More informationProtein function prediction based on sequence analysis
Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005
More information-max_target_seqs: maximum number of targets to report
Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:
More informationRELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES
Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that
More informationHomology. and. Information Gathering and Domain Annotation for Proteins
Homology and Information Gathering and Domain Annotation for Proteins Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES Homology
More informationHidden Markov Models (HMMs) and Profiles
Hidden Markov Models (HMMs) and Profiles Swiss Institute of Bioinformatics (SIB) 26-30 November 2001 Markov Chain Models A Markov Chain Model is a succession of states S i (i = 0, 1,...) connected by transitions.
More informationBLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010
BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for
More informationEBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013
EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1 Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice
More information08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega
BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments
More informationBLAST. Varieties of BLAST
BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database
More informationSequence alignment methods. Pairwise alignment. The universe of biological sequence analysis
he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationBioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing
Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationGenome Annotation. Qi Sun Bioinformatics Facility Cornell University
Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST
More informationCh. 9 Multiple Sequence Alignment (MSA)
Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -
More informationIntroduction to Evolutionary Concepts
Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq
More informationBasic Local Alignment Search Tool
Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses
More informationComparative Bioinformatics Midterm II Fall 2004
Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans
More informationHomology and Information Gathering and Domain Annotation for Proteins
Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The
More informationA profile-based protein sequence alignment algorithm for a domain clustering database
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationProtein Structure Prediction Using Neural Networks
Protein Structure Prediction Using Neural Networks Martha Mercaldi Kasia Wilamowska Literature Review December 16, 2003 The Protein Folding Problem Evolution of Neural Networks Neural networks originally
More informationGenome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.
Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationSequences, Structures, and Gene Regulatory Networks
Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More informationFunctional Annotation
Functional Annotation Outline Introduction Strategy Pipeline Databases Now, what s next? Functional Annotation Adding the layers of analysis and interpretation necessary to extract its biological significance
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More information2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.
Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand
More informationSyllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)
Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural
More informationIntroductory course on Multiple Sequence Alignment Part I: Theoretical foundations
Sequence Analysis and Structure Prediction Service Centro Nacional de Biotecnología CSIC 8-10 May, 2013 Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Course Notes Instructor:
More informationSequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationComparative Features of Multicellular Eukaryotic Genomes
Comparative Features of Multicellular Eukaryotic Genomes C elegans A thaliana O. Sativa D. melanogaster M. musculus H. sapiens Size (Mb) 97 115 389 120 2500 2900 # Genes 18,425 25,498 37,544 13,601 30,000
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/8/07 CAP5510 1 Pattern Discovery 2/8/07 CAP5510 2 Patterns Nature
More informationSUPPLEMENTARY INFORMATION
Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,
More informationHIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION
From THE CENTER FOR GENOMICS AND BIOINFORMATICS Karolinska Institutet, Stockholm, Sweden HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION Markus Wistrand Stockholm 2005 All previously published
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationSI Materials and Methods
SI Materials and Methods Gibbs Sampling with Informative Priors. Full description of the PhyloGibbs algorithm, including comprehensive tests on synthetic and yeast data sets, can be found in Siddharthan
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationChapter 12. Genes: Expression and Regulation
Chapter 12 Genes: Expression and Regulation 1 DNA Transcription or RNA Synthesis produces three types of RNA trna carries amino acids during protein synthesis rrna component of ribosomes mrna directs protein
More informationIntroduction to Bioinformatics Online Course: IBT
Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple
More informationStructure to Function. Molecular Bioinformatics, X3, 2006
Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families
More informationBioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs
Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationWe have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences
Recap We have: Assembled six genomes Made predictions of most likely gene locations We will: Add a layers of biological meaning to the sequences Start with Biology This will motivate the choices we make
More informationTemplate-Based 3D Structure Prediction
Template-Based 3D Structure Prediction Sequence and Structure-based Template Detection and Alignment Issues The rate of new sequences is growing exponentially relative to the rate of protein structures
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology
More informationSequence Analysis and Databases 2: Sequences and Multiple Alignments
1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:
More informationHidden Markov Models and Their Applications in Biological Sequence Analysis
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationGene function annotation
Gene function annotation Paul D. Thomas, Ph.D. University of Southern California What is function annotation? The formal answer to the question: what does this gene do? The association between: a description
More informationTiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1
Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with
More informationMultiple Sequence Alignments
Multiple Sequence Alignments...... Elements of Bioinformatics Spring, 2003 Tom Carter http://astarte.csustan.edu/ tom/ March, 2003 1 Sequence Alignments Often, we would like to make direct comparisons
More informationProcedure to Create NCBI KOGS
Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based
More informationA Protein Ontology from Large-scale Textmining?
A Protein Ontology from Large-scale Textmining? Protege-Workshop Manchester, 07-07-2003 Kai Kumpf, Juliane Fluck and Martin Hofmann Instructive mistakes: a narrative Aim: Protein ontology that supports
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationPractical search strategies
Computational and Comparative Genomics Similarity Searching II Practical search strategies Bill Pearson wrp@virginia.edu 1 Protein Evolution and Sequence Similarity Similarity Searching I What is Homology
More informationHands-On Nine The PAX6 Gene and Protein
Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.
More informationConditional Graphical Models
PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing
More informationPROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES
PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationSequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5
Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many
More informationQuantitative Bioinformatics
Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize
More informationMitochondrial Genome Annotation
Protein Genes 1,2 1 Institute of Bioinformatics University of Leipzig 2 Department of Bioinformatics Lebanese University TBI Bled 2015 Outline Introduction Mitochondrial DNA Problem Tools Training Annotation
More informationIntroduction to protein alignments
Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationSUPPLEMENTARY INFORMATION
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)
More informationBioinformatics Exercises
Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted
More informationSome Problems from Enzyme Families
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems
More informationOrthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona
Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Toni Gabaldón Contact: tgabaldon@crg.es Group website: http://gabaldonlab.crg.es Science blog: http://treevolution.blogspot.com
More information