Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Mining Protein Sequences for Motifs

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CSCE555 Bioinformatics. Protein Function Annotation

-max_target_seqs: maximum number of targets to report

Ch. 9 Multiple Sequence Alignment (MSA)

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Introduction to Bioinformatics

Genome Annotation Project Presentation

Matrix-based pattern discovery algorithms

Protein Structure: Data Bases and Classification Ingo Ruczinski

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

EBI web resources II: Ensembl and InterPro

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Sequence analysis and comparison

Structure to Function. Molecular Bioinformatics, X3, 2006

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Computational Molecular Biology (

Comparative Features of Multicellular Eukaryotic Genomes

Week 10: Homology Modelling (II) - HHpred

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Discovering Binding Motif Pairs from Interacting Protein Groups

Protein function prediction based on sequence analysis

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Gibbs Sampling Methods for Multiple Sequence Alignment

Multiple sequence alignment

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Transcrip:on factor binding mo:fs

Transcription Regulation And Gene Expression in Eukaryotes UPSTREAM TRANSCRIPTION FACTORS

Practical considerations of working with sequencing data

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Hidden Markov Models (HMMs) and Profiles

Gene Control Mechanisms at Transcription and Translation Levels

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

RNA and Protein Structure Prediction

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Motivating the need for optimal sequence alignments...

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Peter Pristas. Gene regulation in eukaryotes

SUPPLEMENTARY MATERIALS

Overview Multiple Sequence Alignment

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Large-Scale Genomic Surveys

Protein Structure Prediction Using Neural Networks

Protein Structure & Motifs

A New Similarity Measure among Protein Sequences

Bioinformatics for Biologists

CAP 5510 Lecture 3 Protein Structures

Computational Biology: Basics & Interesting Problems

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

In-Silico Approach for Hypothetical Protein Function Prediction

Supporting Information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Chapter 7: Regulatory Networks

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

13.4 Gene Regulation and Expression

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Protein Secondary Structure Prediction

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Predicting Protein Functions and Domain Interactions from Protein Interactions

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Today s Lecture: HMMs

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Regulation of Gene Expression

MBLG lecture 5. The EGG! Visualising Molecules. Dr. Dale Hancock Lab 715

Incorporating dependence into models for DNA motifs

A Protein Ontology from Large-scale Textmining?

Lecture 7 Sequence analysis. Hidden Markov Models

Self Similar (Scale Free, Power Law) Networks (I)

Hybridizing Adaptive Biogeography-Based Optimization with Differential Evolution for Motif Discovery Problem

2 The Proteome. The Proteome 15

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

Bioinformatics: Secondary Structure Prediction

DNA Binding Proteins CSE 527 Autumn 2007

Translation and Operons

Flow of Genetic Information

Bio nformatics. Lecture 23. Saad Mneimneh

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Rule learning for gene expression data

5/4/05 Biol 473 lecture

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Transcription:

Amino Acid Structures from Klug & Cummings 2/17/05 1

Amino Acid Structures from Klug & Cummings 2/17/05 2

Amino Acid Structures from Klug & Cummings 2/17/05 3

Amino Acid Structures from Klug & Cummings 2/17/05 4

Motif Detection Tools PROSITE (Database of protein families & domains) Try PDOC00040. Also Try PS00041 PRINTS Sample Output BLOCKS (multiply aligned ungapped segments for highly conserved regions of proteins; automatically created) Sample Output Pfam (Protein families database of alignments & HMMs) Multiple Alignment, domain architectures, species distribution, links: Try MoST PROBE ProDom DIP 2/17/05 5

Protein Information Sites SwissPROT & GenBank InterPRO is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. See sample. PIR Sample Protein page 2/17/05 6

Modular Nature of Proteins Proteins are collections of modular domains. For example, Coagulation Factor XII F2 E F2 E K Catalytic Domain F2 E K K Catalytic Domain PLAT 2/17/05 7

CDART Domain Architecture Tools Protein AAH24495; Domain Architecture; It s domain relatives; Multiple alignment for 2 nd domain SMART 2/17/05 8

Predicting Specialized Structures COILS Predicts coiled coil motifs TMPred predicts transmembrane regions SignalP predicts signal peptides SEG predicts nonglobular regions 2/17/05 9

Motifs in Protein Sequences Motifs are combinations of secondary structures in proteins with a specific structure and a specific function. They are also called super-secondary structures. Examples: Helix-Turn-Helix, Zinc-finger, Homeobox domain, Hairpin-beta motif, Calcium-binding motif, Beta-alpha-beta motif, Coiled-coil motifs. Several motifs may combine to form domains. Serine proteinase domain, Kringle domain, calcium-binding domain, homeobox domain. 2/17/05 10

Motif Detection Problem Input: Set, S, of known (aligned) examples of a motif M, A new protein sequence, P. Output: Does P have a copy of the motif M? Example: Zinc Finger Motif YKCGLCERSFVEKSALSRHORVHKN 3 6 19 23 Input: Output: Database, D, of known protein sequences, A new protein sequence, P. What interesting patterns from D are present in P? 2/17/05 11

Motifs in DNA Sequences Given a collection of DNA sequences of promoter regions, locate the transcription factor binding sites (also called regulatory elements) Example: 2/17/05 12 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Motifs 2/17/05 13 http://weblogo.berkeley.edu/examples.html

Motifs in DNA Sequences 2/17/05 14 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

More Motifs in E. Coli DNA Sequences 2/17/05 15 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

2/17/05 16 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Other Motifs in DNA Sequences: Human Splice Junctions 2/17/05 17 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Motifs in DNA Sequences 2/17/05 18

Profile Method Motif Detection If many examples of the motif are known, then Training: build a Profile and compute a threshold Testing: score against profile Gibbs Sampling Expectation Method HMM Combinatorial Pattern Discovery Methods 2/17/05 19

How to evaluate these methods? Calculate TP, FP, TN, FN Compute sensitivity fraction of known sites predicted, specificity, and more. Sensitivity= TP/(TP+FN) Specificity = TN/(TN+FN) Positive Predictive Value = TP/(TP+FP) Performance Coefficient = TP/(TP+FN+FP) Correlation Coefficient = 2/17/05 20

Helix-Turn-Helix Motifs Structure 3-helix complex Length: 22 amino acids Turn angle Function Gene regulation by binding to DNA 2/17/05 21 Branden & Tooze

DNA Binding at HTH Motif 2/17/05 22 Branden & Tooze

HTH Motifs: Examples Loc Protein Helix 2 Turn Helix 3 Name -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 14 Cro F G Q E K T A K D L G V Y Q S A I N K A I H 16 434 Cro M T Q T E L A T K A G V K Q Q S I Q L I E A 11 P22 Cro G T Q R A V A K A L G I S D A A V S Q W K E 31 Rep L S Q E S V A D K M G M G Q S G V G A L F N 16 434 Rep L N Q A E L A Q K V G T T Q Q S I E Q L E N 19 P22 Rep I R Q A A L G K M V G V S N V A I S Q W E R 24 CII L G T E K T A E A V G V D K S Q I S R W K R 4 LacR V T L Y D V A E Y A G V S Y Q T V S R V V N 167 CAP I T R Q E I G Q I V G C S R E T V G R I L K 66 TrpR M S Q R E L K N E L G A G I A T I T R G S N 22 BlaA Pv L N F T K A A L E L Y V T Q G A V S Q Q V R 23 TrpI Ps N S V S Q A A E Q L H V T H G A V S R Q L K 2/17/05 23

Basis for New Algorithm Combinations of residues in specific locations (may not be contiguous) contribute towards stabilizing a structure. Some reinforcing combinations are relatively rare. 2/17/05 24

New Motif Detection Algorithm Pattern Generation: Aligned Motif Examples Pattern Generator Motif Detection: New Protein Sequence Motif Detector Pattern Dictionary Detection Results 2/17/05 25

Patterns Loc Protein Helix 2 Turn Helix 3 Name -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 14 Cro F G Q E K T A K D L G V Y Q S A I N K A I H 16 434 Cro M T Q T E L A T K A G V K Q Q S I Q L I E A 11 P22 Cro G T Q R A V A K A L G I S D A A V S Q W K E 31 Rep L S Q E S V A D K M G M G Q S G V G A L F N 16 434 Rep L N Q A E L A Q K V G T T Q Q S I E Q L E N 19 P22 Rep I R Q A A L G K M V G V S N V A I S Q W E R 24 CII L G T E K T A E A V G V D K S Q I S R W K R 4 LacR V T L Y D V A E Y A G V S Y Q T V S R V V N 167 CAP I T R Q E I G Q I V G C S R E T V G R I L K 66 TrpR M S Q R E L K N E L G A G I A T I T R G S N 22 BlaA Pv L N F T K A A L E L Y V T Q G A V S Q Q V R 23 TrpI Ps N S V S Q A A E Q L H V T H G A V S R Q L K Q1 G9 N20 A5 G9 V10 I15 2/17/05 26

Pattern Mining Algorithm Algorithm Pattern-Mining Input: Motif length m, support threshold T, list of aligned motifs M. Output: Dictionary L of frequent patterns. 1. L 1 := All frequent patterns of length 1 2. for i = 2 to m do 3. C i := Candidates(L i-1 ) 4. L i := Frequent candidates from C i 5. if ( L i <= 1) then 6. return L as the union of all L j, j <= i. 2/17/05 27

Candidates Function L 3 G1, V2, S3 G1, V2, T6 G1, V2, I7 G1, V2, E8 G1, S3, T6 G1, T6, I7 V2, T6, I7 V2, T6, E8 C 4 G1, V2, S3, T6 G1, V2, S3, I7 G1, V2, S3, E8 G1, V2, T6, I7 G1, V2, T6, E8 G1, V2, I7, E8 V2, T6, I7, E8 L 4 G1, V2, S3, T6 G1, V2, S3, I7 G1, V2, S3, E8 G1, V2, T6, E8 V2, T6, I7, E8 2/17/05 28

Motif Detection Algorithm Algorithm Motif-Detection Input : Motif length m, threshold score T, pattern dictionary L, and input protein sequence P[1..n]. Output : Information about motif(s) detected. 1. for each location i do 2. S := MatchScore(P[i..i+m-1], L). 3. if (S > T) then 4. Report it as a possible motif 2/17/05 29

Experimental Results: GYM 2.0 Motif HTH Motif (22) Protein Number GYM = DE Number GYM = Annot. Family Tested Agree Annotated Master 88 88 (100 %) 13 13 Sigma 314 284 + 23 (98 %) 96 82 Negates 93 86 (92 %) 0 0 LysR 130 127 (98 %) 95 93 AraC 68 57 (84 %) 41 34 Rreg 116 99 (85 %) 57 46 Total 675 653 + 23 (94 %) 289 255 (88 %) 2/17/05 30

Experiments Basic Implementation (Y. Gao) Improved implementation & comprehensive testing (K. Mathee, GN). Implementation for homeobox domain detection (X. Wang). Statistical methods to determine thresholds (C. Bu). Use of substitution matrix (C. Bu). Study of patterns causing errors (N. Xu). Negative training set (N. Xu). NN implementation & testing (J. Liu & X. He). HMM implementation & testing (J. Liu & X. He). 2/17/05 31

Motif Detection (TFBMs) See evaluation by Tompa et al. [bio.cs.washington.edu/assessment] Gibbs Sampling Methods: AlignACE, GLAM, SeSiMCMC, MotifSampler Weight Matrix Methods: ANN-Spec, Consensus, EM: Improbizer, MEME Combinatorial & Misc.: MITRA, oligo/dyad, QuickScore, Weeder, YMF 2/17/05 32

Gibbs Sampling for Motif Detection 2/17/05 33