Overview of Research at Bioinformatics Lab

Similar documents
Introduction to Bioinformatics

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Graph Alignment and Biological Networks

Computational Systems Biology

Hidden Markov Models (I)

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

#33 - Genomics 11/09/07

Comparative genomics: Overview & Tools + MUMmer algorithm

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Mining and classification of repeat protein structures

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Genomics and bioinformatics summary. Finding genes -- computer searches

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Predicting Protein Functions and Domain Interactions from Protein Interactions

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

STRUCTURAL BIOINFORMATICS I. Fall 2015

Week 10: Homology Modelling (II) - HHpred

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Hidden Markov Models

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Comparative Network Analysis

Learning in Bayesian Networks

Some Problems from Enzyme Families

Computational Biology: Basics & Interesting Problems

Computational methods for predicting protein-protein interactions

Sequence Alignment Techniques and Their Uses

AN AXIOMATIC DESIGN FOR MODELING BIOLOGICAL SYSTEMS

Computational Molecular Biology (

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Cellular Systems Biology or Biological Network Analysis

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

Charalampos (Babis) E. Tsourakakis WABI 2013, France WABI '13 1

86 Part 4 SUMMARY INTRODUCTION

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Lecture 3: A basic statistical concept

Markov Models & DNA Sequence Evolution

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Bioinformatics Chapter 1. Introduction

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

CAP 5510 Lecture 3 Protein Structures

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

Dr. Amira A. AL-Hosary

Computational Genomics

CSCE555 Bioinformatics. Protein Function Annotation

A A A A B B1

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Content Descriptions Based on the Georgia Performance Standards. Biology

Basic modeling approaches for biological systems. Mahesh Bule

Multiple Sequence Alignment: HMMs and Other Approaches

Phylogenetic Analysis of Molecular Interaction Networks 1

BSc MATHEMATICAL SCIENCE

Preface. Contributors

Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles

Homology Modeling. Roberto Lins EPFL - summer semester 2005

AP Curriculum Framework with Learning Objectives

CISC 889 Bioinformatics (Spring 2004) Lecture 1

Comparative Bioinformatics Midterm II Fall 2004

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Computational Biology From The Perspective Of A Physical Scientist

Inferring Causal Phenotype Networks from Segregating Populat

CS612 - Algorithms in Bioinformatics

Visualize Biological Database for Protein in Homosapiens Using Classification Searching Models

Protein Complex Identification by Supervised Graph Clustering

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

Introduction to Bioinformatics Online Course: IBT

Structure to Function. Molecular Bioinformatics, X3, 2006

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Big Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes.

MiGA: The Microbial Genome Atlas

Probabilistic Arithmetic Automata

Inferring Protein-Signaling Networks

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Computational Genomics and Molecular Biology, Fall

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

AP Biology UNIT 1: CELL BIOLOGY. Advanced Placement

Machine Learning for Structured Prediction

University of Florida CISE department Gator Engineering. Clustering Part 1

Pattern Recognition and Machine Learning

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

SUPPLEMENTARY INFORMATION

Biological Systems: Open Access

Transcription:

Overview of Research at Bioinformatics Lab Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable 1

Motivations Understanding correlations between genotype and phenotype Predicting genotype <=> phenotype Some Phenotype examples: Protein function Drug/therapy response Drug-drug interactions for expression Drug mechanism Interacting pathways of metabolism 2

Bioinformatics in a cell 3

Credit:Kellis & Indyk 4

Projects Genome sequencing and assembly (funded by NSF) Homology detection, protein family classification (funded by a DuPont S&E award) Support Vector Machines Hidden Markov models Graph theoretic methods Probabilistic modeling for BioSequence (funded by NIH) HMMs, and beyond Motifs finding Secondary structure Systems Bioinformatics Prediction of Protein-Protein Interactions Inference of Gene Regulatory Networks Prediction of other regulatory elements Pattern analysis for RNAi (funded by UDRF) Comparative Genomics Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) 5

People Current members: - Roger Craig (PhD student) - Alvaro Gonzalez (PhD student) - Kevin McCormick (PhD student) - Colin Kern (PhD student) Past members: - Dr. Wen-Zhong Wang (Postdoc Fellow) - Robel Kahsay (Ph.D. currently at DuPont Central Research & Development) - Kishore Narra (M.S. currently at VistaPrint, Inc.) - Arpita Gandhi (M.S. currently at Colgate-Palmolive Company) - Gaurav Jain (M.S. currently at Institute of Genomics, Univ. of Maryland) - Shivakundan Singh Tej (M.S.) - Tapan Patel (B.S. currently in MD/PhD program at U Penn) - Laura Shankman (B.S., currently in PhD program at U Virginia) 6

7

8

Hybrid Hierarchical Assembly Three types of reads: Sanger (~1000bp), 454 (~100bp), and SBS (~30bp). Assembly of individual types using the best suited assemblers. Phrap, TIGR, etc. for Sanger reads Euler assembler and Newbler for 454 reads Euler short, Shorty for SBS reads Hybrid and hierarchical Use longer reads as scaffolding to resolve repeat regions that are difficult for shorter reads Use contigs from shorter reads (pyrosequencing) as pseudoreads to bridge gaps (nonclonable and hard stops) with Sanger reads. 9

Major Findings Hybrid hierarchical assembly is proved to be an effective way for assembling short reads Incremental approach to selecting ABI reads is more effective than random approach in generating high coverage contigs Staged assembly using Phrap is an effective alternative to the proprietary Newbler assembler. Publications: Gonzalez & Liao, BMC Bioinformatics 2008, 9:102. 10

Blue lines are contigs generated from hybrid assembly 11

Detect remote homologues Attributes: - Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: - Quasi-consensus based comparison of profile HMM for protein sequences (Kahsay et al, Bioinformatics 2005) - Using extended phylogenetic profiles and support vector machines for protein family classification (Narra & Liao, SNPD04, Craig & Liao, ICMLA 05, Craig & Liao SAC 06, Craig & Liao, Int l J. Bioinfo & DM 2007) - Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003) 12

Non-linear mapping to a feature space Φ( ) x j x i Φ(x i ) Φ(x j ) L( ) = i ½ i j y i y j Φ (x i ) Φ (x j ) 13

Hamming distance Tree-based distance Data: phylogenetic profiles - How to account for correlations among profile components? profile extension (Narra & Liao, SNPD 04) Transductive learning (Craig & Liao, ICMLA 05, SAC 06, IJBDM, 2007) x = y = z = 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 = 3 0.1 = 3 0.5 14

0.67 0.34 0.55 0.75 Post-order traversal 1 0.33 0.5 1 1 0 1 0 0 0 1 1 1 0.33 0.67 0.34 0.5 0.75 0.55 15

16

Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? Patterns/motifs Secondary structure To capture long range correlations of bio sequences Transporter proteins RNA secondary structure Methods: generative versus discriminative Linear dependent processes Stochastic grammars Model equivalence 17

TMMOD: An improved hidden Markov model for predicting transmembrane topology (Kahsay, Gao & Liao. Bioinformatics 2005) 18

Mod. Reg. Data set Correct topology Correct location Sensitivity Specificity TMMOD 1 (a) (b) (c) S-83 65 (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%) 65 (78.3%) 97.4% 71.3% 97.1% 97.4% 71.3% 97.1% TMMOD 2 (a) (b) (c) S-83 61 (73.5%) 54 (65.1%) 54 (65.1%) 65 (78.3%) 61 (73.5%) 66 (79.5%) 99.4% 93.8% 99.7% 97.4% 71.3% 97.1% TMMOD 3 (a) (b) (c) S-83 70 (84.3%) 64 (77.1%) 74 (89.2%) 71 (85.5%) 65 (78.3%) 74 (89.2%) 98.2% 95.3% 99.1% 97.4% 71.3% 97.1% TMHMM S-83 64 (77.1%) 69 (83.1%) 96.2% 96.2% PHDtm S-83 (85.5%) (88.0%) 98.8% 95.2% TMMOD 1 (a) (b) (c) S-160 117 (73.1%) 92 (57.5%) 117 (73.1%) 128 (80.0%) 103 (64.4%) 126 (78.8%) 97.4% 77.4% 96.1% 97.0% 80.8% 96.7% TMMOD 2 (a) (b) (c) S-160 120 (75.0%) 97 (60.6%) 118 (73.8%) 132 (82.5%) 121 (75.6%) 135 (84.4%) 98.4% 97.7% 98.4% 97.2% 95.6% 97.2% TMMOD 3 (a) (b) (c) S-160 120 (75.0%) 110 (68.8%) 135 (84.4%) 133 (83.1%) 124 (77.5%) 143 (89.4%) 97.8% 94.5% 98.3% 97.6% 98.1% 98.1% TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7% 19

20

21

Inferring Regulatory Networks from Time Course Expression Data (Gandhi, Cogburn & Liao, 2008) Expression Profile Clustering K-mean Binary heat map Boolean network algorithm 22

GENOMIC COMPARISON OF BACTERIAL SPECIES BASED ON METABOLIC CHARACTERISTICS (Jain, Wang, Boyd, Liao, ISIBM 2009) 23

24