Domain-based computational approaches to understand the molecular basis of diseases

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Week 10: Homology Modelling (II) - HHpred

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

EECS730: Introduction to Bioinformatics

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence analysis and comparison

Basic Local Alignment Search Tool

Bioinformatics. Dept. of Computational Biology & Bioinformatics

In-Depth Assessment of Local Sequence Alignment

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Large-Scale Genomic Surveys

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

BLAST. Varieties of BLAST

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Sequence Alignment Techniques and Their Uses

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Tools and Algorithms in Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Exercise 5. Sequence Profiles & BLAST

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Quantifying sequence similarity

An Introduction to Bioinformatics Algorithms Hidden Markov Models

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Protein function prediction based on sequence analysis

Bioinformatics and BLAST

Algorithms in Bioinformatics

Introduction to Bioinformatics

STRUCTURAL BIOINFORMATICS II. Spring 2018

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Optimization of a New Score Function for the Detection of Remote Homologs

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Substitution matrices

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

The Pennsylvania State University. The Graduate School. College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION,

SUPPLEMENTARY INFORMATION

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Introduction to Bioinformatics

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Scoring Matrices. Shifra Ben-Dor Irit Orr

BIOINFORMATICS: An Introduction

Sequence analysis and Genomics

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Hidden Markov Models

Similarity searching summary (2)

Sequence Database Search Techniques I: Blast and PatternHunter tools

Tools and Algorithms in Bioinformatics

Practical considerations of working with sequencing data

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

bioinformatics 1 -- lecture 7

Introduction to Bioinformatics Online Course: IBT

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Bioinformatics Exercises

BIOINFORMATICS LAB AP BIOLOGY

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Computational methods for predicting protein-protein interactions

Motivating the need for optimal sequence alignments...

Tutorial 4 Substitution matrices and PSI-BLAST

Single alignment: Substitution Matrix. 16 march 2017

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

CSCE555 Bioinformatics. Protein Function Annotation

Introduction to Evolutionary Concepts

SUPPLEMENTARY INFORMATION

Bioinformatics for Biologists

CS612 - Algorithms in Bioinformatics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Sequence Analysis '17 -- lecture 7

Genomics and bioinformatics summary. Finding genes -- computer searches

Computational Biology

BLAST: Basic Local Alignment Search Tool

Local Alignment Statistics

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Introduction to protein alignments

EBI web resources II: Ensembl and InterPro

Bioinformatics Chapter 1. Introduction

A Method for Aligning RNA Secondary Structures

Basics of protein structure

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Biophysics Major - Requirements

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Copyright 2000 N. AYDIN. All rights reserved. 1

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Homology and Information Gathering and Domain Annotation for Proteins

Multiple sequence alignment

Transcription:

Domain-based computational approaches to understand the molecular basis of diseases Dr. Maricel G. Kann Assistant Professor Dept of Biological Sciences UMBC http://bioinf.umbc.edu

Research at Kann s Lab. Bioinformatics and Computational Biology: Protein Domain Recognition Human Protein Domain Database (HPDD) Developing new metrics to compare bioinformatics methodologies Systems Biology: Computational approaches to predict domain-domain interactions Understanding co-evolution of protein and domain interactions Understanding the molecular basis of diseases: Mapping disease mutational data into HPDD Text-mining of abstracts to extract disease mutations Using domain profiling to analyze gene expression data 2

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and their relevance Methods for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 3

Definitions Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. 4

What is systems biology? Systems Biology: the study of complex biological processes in a manner that seeks to understand how individual molecular components combine on a global scale to yield particular structures, functions, and behaviors in response to specific perturbations Alternative perspective Q: What is mathematics? A: Thing that mathematicians do. Q: What is systems biology? A: Thing that most of us will be doing in a few years? Slide by Teresa Przytycka (NCBI,NIH) 5

The Human Genome 6

Human Genome Project GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGA GACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCT GAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTA ACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACG CAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCA ATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGTACTCGAGCCCTGTCTCAAG AATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTCAAAGCTCCTTGCCGAG AGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCTCACA TCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAAT TATATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAG CATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAG TAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAA TCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACA ATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTT CTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCG ACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAG ATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATC CTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTA CGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTT TACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCG CTACAGACATTGAAGGATTTTCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTA ACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTC TAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGC TCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGA ACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGA AGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATG GTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACT AATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCC AAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTT AACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAG TTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCTCC GCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAA GCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTTGCTTCCTA 7 ATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGA TTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGA

Sidney Harris 8

What are we interested in? Protein sequence MTQLQISLLLTATISLLHLVVATPYEA YPIGKQYPPVARVNESFTFQISNDTYK SSVDKTAQITYNCFDLPSWLSFDSSSR TFSGEPSSDLLSDANTTLYFNVILEGT DSADSTSLNNTYQFVVTNRPSISLSSD FNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQL YNAPLPNWLFRRRENPDDENLPHAIS GPDLNNPANKPNQENATPLNNPFDDD What is this protein? What is its function? DNA sequence GATCCTCCATATACAACGGTATCTCCACCT CAGGTTTAGATCTCAACAACGGAACCATTG CCGACATGAGACAGTTAGGTATCGTCGAG AGTTACAAGCTAAAACGAGCAGTAGTCAG CTCTGCATCTGAAGCCGCTGAAGTTCTACT AAGGGTGGATAACATCATCCGTGCAAGAC CAAGAACCGCCAATAGACAACATATGTAA CATATTTAGGATATACCTCGAAAATAATAA ACCGCCACACTGTCATTATTATAATTAGAA ACAGAACGCAGCTACAGACATTGAAGGAT TTTCT What does interact with? Is this protein involved in a disease? Kann et al. JMB (2008); Kann et al. Proteins (2007) 9

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and why are they important GLOBAL: A tool for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 10

The BLAST Revolution BLAST: Basic Local Alignment Search Tool Transferring functional information using sequence similarity BLAST is fast! 11

Protein Classification A L I A L I A L I QUERY G N M E N T G G N M E N G N M E N T Alignment Algorithm Scoring Function Accurate Statistics Set of related sequences or protein family from database A L I G - N M E N T A L I G G N M E N - A L I G G N M E N 4 3 4 7 1 2-2 0 0 score=19 PAM: Dayhoff et al. (1978); BLOSUM: Henikoff & Henikoff (1992); OPTIMA:Kann et al. (2000). 12

Significance of a score Estimated number of non-related sequences in the database that score higher than the query D= size of database E = ps ( < S) D Q R 13

# of alignments with score S S S Q random scores Alignments scores ps ( < S) = 1 exp[ KMNe λ S R ] Q R 14

15

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and their relevance Methods for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 16

Protein Domains The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core. 17

Conserved Domains In 1974 Michael Rossman recognized the NADH binding domain in several dehydrogenases (named after him). Conserved domains are determined by sequence comparative analysis. Molecular evolution uses such domains as building blocks They may be recombined in different arrangements to make proteins with different functions. Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains. 18

heme-binding site It combines information about protein sequence, their conservation patterns across evolution and the protein structure and provide useful functional annotation. Marchler-Bauer et al (2003) NAR 383:387 19

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and their relevance Methods for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 20

Protein Classification QUERY Alignment Algorithm Scoring Function Accurate Statistics PSSM can be derived from the MSA Set of related sequences or protein family from database A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. 21

MSA contains conserved blocks 22

Protein Sequence Conservation Occurs in Blocks with Intervening Gaps Protein Structure Alignment α-helix red sequence β-strand loops Subsequences corresponding to secondary structure elements (SSEs: α- helices and β-strands) are more conserved than the intervening loops. blue sequence 23

Protein domain representation 1 2 gap gap CDD footprint 24

Sequence-PSSM alignment A L I G N M E N T 25

ROC curve for GLOBAL 0.40 0.35 ROC 10000 ROC 50000 ROC 200000 GLOBAL 0.181 0.224 0.313 HMMer semiglobal 0.185 0.224 0.299 HMMer local 0.169 0.194 0.239 rpsblast 0.168 0.192 0.229 Fraction of true positives 0.30 0.25 0.20 0.15 0.10 0.05 GLOBAL HMMer-semi-global HMMer-local RPS-BLAST 0.00 0.00 0.01 0.02 0.03 0.04 0.05 Fraction of false positives 26

GLOBAL Method 27

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and their relevance Methods for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 28

Suznick et al. NAR (submitted) 29

HPDD: Gene Pages 30

HPDD: Protein Pages 31

32

Building HPDD HPDD Statistics: Total of 4,488 human protein domains 2,578 from Pfam 1,402 curated from CDD 407 from Smart 97 from COG 4 from PRK Suznick et al. NAR (submitted) 33

Outline Introduction Sequence Similarity and the BLAST revolution Protein Domains and their relevance Methods for Protein Domain Recognition HPDD: Human Protein Domain Database Predictors of Domain-Domain Interactions 34

Prediction of protein-protein or domain-domain interactions Why do we need to use computational methods to predict interactions? Experimental data are noisy and incomplete From successes/failures of computational methods we can learn about nature of interactions In the case of domain-domain interactions no large scale data are available 35

Organism 1 Organism 2 Organism 3 Organism n MSA of Protein A Canonical tree MSA of Protein B Orthologs (optional methoddependent step) Phylogenetic Trees Distance Matrices ΔA ΔB similar? Mirrortree Method with correction for speciation: Pazos et al..jmb (2005) Sato et al. Bioinformatics (2005) Kann et al. JMB (2008) Kann et al. Proteins (2007) 36

Predicting Protein interactions: 37

1gph_1 1gph_2 1. Identify binding neighborhoods map them onto MSA. species 1 species 2. species n. 2. Compute vector of pairwise distances randomly selected l. binding l. randomly selected m. binding m. r 1 b 1 r 2 b 2 3.Subtract speciation (s) s s s s 4. Compute correlations r 2 b 2 r 1 b 1 5. Compare results Fraction of true positives corr. between binding corr. between random Fraction of false positives 38

Predicting domain-domain interactions: Please ask for reprints (just accepted for publication in the Journal of Molecular Biology) 39

40

Kann s Computational Biology lab. Current members: Attila Kertesz-Farkas (postdoc) Grad. Students: Michael Martin (PhD/Rotation) Trevor Suznick (BS/MS, Bioinf) Yanan Sun (MS, IS) Diana Reginf (MS, IS) Undergraduates: Ivette Santana-Cruz (BS, Bioinf) Joy Adewumi (senior, CS) Mike Povolotzky (junior, Bioinf) Richard Blissett (soph, Bioinf) Methzli Rodriguez (soph, CS/Bioinf) Asa Adadey (soph, Bioinf) Past members: Brian Bennet (Bionf) Andrew Winder (CS) Chris Alexander (Biology) 41

Kann s Computational Biology lab. 42

CBIG: Computational Biology Interest Group CBIG: a forum for exchanging ideas and initiating collaborations between groups, in particular experimental and computational. Seminars, events and news related to computational biology. Students at all levels, as well as postodocs and faculty members can subscribe to the e-mail list cbig@lists.umbc.edu Coming soon: http://bioinf.umbc.edu/cbig 43

The End 44