CGS 5991 (2 Credits) Bioinformatics Tools

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

BSC 4934: QʼBIC Capstone Workshop" Giri Narasimhan. ECS 254A; Phone: x3748

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Pairwise sequence alignment

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Computational Structural Bioinformatics

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

Procedure to Create NCBI KOGS

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

ECOL/MCB 320 and 320H Genetics

String Matching Problem

Three types of RNA polymerase in eukaryotic nuclei

Comparative genomics: Overview & Tools + MUMmer algorithm

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Welcome to BIOL 572: Recombinant DNA techniques

BIOINFORMATICS LAB AP BIOLOGY

Computational Biology: Basics & Interesting Problems

Chapter 15 Active Reading Guide Regulation of Gene Expression

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Overview of I519 & Introduction to Bioinformatics. Yuzhen Ye School of Informatics and Computing, IUB

Small RNA in rice genome

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

GCD3033:Cell Biology. Transcription

Flow of Genetic Information

Chapter 18 Active Reading Guide Genomes and Their Evolution

Introduction to Bioinformatics Integrated Science, 11/9/05

Bioinformatics Chapter 1. Introduction

Microbiome: 16S rrna Sequencing 3/30/2018

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Genomes and Their Evolution

Introduction to Bioinformatics

Motivating the need for optimal sequence alignments...

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

Bacterial Genetics & Operons

Frequently Asked Questions (FAQs)

Chapter 17. From Gene to Protein. Biology Kevin Dees

Bioinformatics Exercises

From Gene to Protein

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

MCB142/ICB163 Overview. Instructors:

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Chapter 20. Initiation of transcription. Eukaryotic transcription initiation

Genomics and bioinformatics summary. Finding genes -- computer searches

Hands-On Nine The PAX6 Gene and Protein

The Gene The gene; Genes Genes Allele;

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Virginia Western Community College BIO 101 General Biology I

Regulation of gene Expression in Prokaryotes & Eukaryotes

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Advanced Algorithms and Models for Computational Biology

SCOTCAT Credits: 20 SCQF Level 7 Semester 1 Academic year: 2018/ am, Practical classes one per week pm Mon, Tue, or Wed

CISC 889 Bioinformatics (Spring 2004) Lecture 1

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Supplemental Materials

Algorithmics and Bioinformatics

Lecture Materials are available on the 321 web site

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

Molecular Biology - Translation of RNA to make Protein *

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

microrna Studies Chen-Hanson Ting SVFIG June 23, 2018

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

1. In most cases, genes code for and it is that

2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

Sequencing alignment Ameer Effat M. Elfarash

Regulation of Gene Expression

PLNT2530 (2018) Unit 5 Genomes: Organization and Comparisons

BIOLOGY 146 Course Layout

G4120: Introduction to Computational Biology

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

RNA Synthesis and Processing

13.4 Gene Regulation and Expression

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Translation Part 2 of Protein Synthesis

GATA family of transcription factors of vertebrates: phylogenetics and chromosomal synteny

Unit 5: Cell Division and Development Guided Reading Questions (45 pts total)

Introduction. Gene expression is the combined process of :

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Comparative Bioinformatics Midterm II Fall 2004

Principles of Genetics

Honors Biology Reading Guide Chapter 11

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

UNIT 5. Protein Synthesis 11/22/16

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

R.S. Kittrell Biology Wk 10. Date Skill Plan

Transcription:

CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1

Course Schedules CAP 5991 (3 credit) will meet every Tue from 11AM to 1:45PM. CGS 5991 (2 credit) will meet every Tue from 11AM to about 1PM. This course is not for CS students. Different exams and evaluation. 8/26/03 CAP/CGS 5991: Lecture 1 2

CAP/CGS 5991 Introduction to Bioinformatics Preliminaries Sequence Alignment Multiple Sequence Alignment Phylogenetic Analysis Molecular Structure Analysis Gene Recognition Genomics, Functional Genomics Proteomics Pattern Discovery Techniques Programming Environments: BioPerl Overview of Course Databases and Software Packages Statistics for Bioinformatics Sequencing and Mapping Computational Learning Methods HMM, NN, SOM, SVM, GA Computational Predictive Methods Microarray Data Analysis Digital Image Analysis Protein Structure Analysis: SPDBV Emerging Biotechnologies 8/26/03 CAP/CGS 5991: Lecture 1 3

CAP/CGS 5991 Software Packages Databases and Software Packages (GenBank, SWISS-PROT) Programming Environments (BioPerl) Sequence Alignment & Multiple Sequence Alignment (BLAST, CLUSTALW, CLUSTALX) Phylogenetic Analysis (CLUSTALW, Phylip, PAUP, PAML) Learning Methods (HMMPro, GeneCluster, ASOM) Pattern Discovery Techniques (GYM, TEIRESIAS, APRIORI) Molecular Structure Analysis (DALI, RASMOL, SPDBV) Microarray Analysis (CLUSTER, SAM, GeneCluster, TreeView) Statistical Software Packages (SAS, R) 8/26/03 CAP/CGS 5991: Lecture 1 4

Evaluation Semester Project (50 %) Homework Assignments (20 %) Exams (20 %) Class Participation (10 %) 8/26/03 CAP/CGS 5991: Lecture 1 5

Reading List and Schedule Watch that Course Home Page! http://www.cs.fiu.edu/~giri/teach/5991f03.html 8/26/03 CAP/CGS 5991: Lecture 1 6

1. What is Bioinformatics? Introduction Analysis of biological data with computing & statistical tools. 2. The different aspects of Informatics: Data Management (Database Technology, Internet Programming) Analysis/Interpretation of Data (Data Mining, Modeling, Statistical Tools) Development of Algorithms/ Data Structures Visualization and Interface Design (HCI, Graphics) 3. How to assist biological research? propose new models or correlations based on data from experiments verify a proposed model using known data propose new experiments based on model or analysis use predicted information to narrow down search in a biological investigation 8/26/03 CAP/CGS 5991: Lecture 1 7

Overall Goals DNA Sequence Gene Protein Structure Gene Networks Function Metabolic Pathways 8/26/03 CAP/CGS 5991: Lecture 1 8

General Information Over 20 billion bases in the NCBI database: http://www.ncbi.nlm.nih.gov Human Genome has ~3 billion bp with 30,000+ genes. Viruses have 300bp to 300Kb (1 st one in 1978: Simian virus; 5Kb). 86 complete microbial genomes sequenced. Number of whole genomes sequenced is over 100 (not including over 1000 viruses), including: Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, Saccharomyces cerevisiae, Chromosomal maps for many organisms including: Mus musculus, Homo sapiens, Danio rerio, Zea mays, Oryza sativa Swiss-Prot has over 132000 protein sequences. 8/26/03 CAP/CGS 5991: Lecture 1 9

Genome Sizes Organism Size Date Est. # genes HIV type 1 10 Kb H. influenzae 1.8 Mb 1995 1,740 E. coli 4.7 Mb 1997 4,000 S. cerevisiae 12.1 Mb 1996 6,034 C. elegans 97 Mb 1998 19,099 A. thaliana 100 Mb 2000 25,000 D. melanogaster 180 Mb 2000 13,061 M. musculus 3 Gb 2002 ~30,000 H. sapiens 3 Gb 2001 30,000+ 8/26/03 CAP/CGS 5991: Lecture 1 10

Caenorhabditis Elegans Entire genome - 1998 1 st animal; 26 th organism 8 year effort 97 million bases 19,099 genes 402 gene clusters 12,000 genes with known function thousands of mutants 7000 families of repeats Multicellular organism Nematode (phylum) Easy to experiment with Easily observable 959 cells 302 nerve cells 36% of proteins common w/ human 8/26/03 CAP/CGS 5991: Lecture 1 11

Homo sapiens 15 year effort, 3 billion bases, 100,000 gaps Variable density of: Genes, SNPs, CpG islands, recombination rates ~1.1% of the genome codes for proteins ~ 40-48 % of the genome consists of repeat sequences ~10 % of the genome consists of repeats called ALUs ~5 % of the genome consists of long repeats (>1 Kb) ~ 50 transposon-derived genes 223 genes common with bacteria that are missing from worm, fly or yeast. 8/26/03 CAP/CGS 5991: Lecture 1 12

(Approximate) String Matching Input: Text T, Pattern P Question(s): Does P occur in T? Find one occurrence of P in T. Find all occurrences of P in T. Count # of occurrences of P in T. Find longest substring of P in T. Find closest substring of P in T. Locate direct repeats of P in T. Many More variants Applications: Is P already in the database T? Locate P in T. Can P be used as a primer for T? Is P homologous to anything in T? Has P been contaminated by T? Is prefix(p) = suffix(t)? Locate tandem repeats of P in T. 8/26/03 CAP/CGS 5991: Lecture 1 13

The Suffix Tree Data Structure Borrelia burgdorferi: 1 million bases Shotgun Sequencing: 4612 fragments 2 million bases long totally Using suffix trees - 15 min for Fragment Assembly Using Dynamic Programming - 10 days 8/26/03 CAP/CGS 5991: Lecture 1 14

Repeats in DNA Sequences Genomic Imprinting: Some genes are expressed only when inherited from one specific parent. 16 such genes are known; 5 inherited from mother; rest from father. These 16 genes have a lot of repeats. Repeats are of size 25 to 120 bp and of total length 1500. The repeats are unique to these imprinted regions. They have no obvious homolgy to each other or to other highly repetitive mammalian sequences. Repeats are also known to be responsible for several genetic diseases: Fragile X, Huntington s disease, Kennedy s disease, myotonic dystrophy, ataxia. 8/26/03 CAP/CGS 5991: Lecture 1 15

Drosophila Eyeless vs. Human Aniridia 24 IERLPSLEDMAHKGHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 83 I R P+ M + HSGVNQLGGVFV GRPLPDSTRQKIVELAHSGARPCDISRILQVSN 17 IPRPPARASMQNS-HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSN 75 84 GCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQEN 143 GCVSKILGRYYETGSIRPRAIGGSKPRVAT EVVSKI+QYKRECPSIFAWEIRDRLL E 76 GCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEG 135 144 VCTNDNIPSVSSINRVLRNLAAQKEQ 169 VCTNDNIPSVSSINRVLRNLA++K+Q 136 VCTNDNIPSVSSINRVLRNLASEKQQ 161 398 TEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQV 457 +++ Q RL LKRKLQRNRTSFT +QI++LEKEFERTHYPDVFARERLA KI LPEARIQV 222 SDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQV 281 458 WFSNRRAKWRREEKLRNQRR 477 WFSNRRAKWRREEKLRNQRR 282 WFSNRRAKWRREEKLRNQRR 301 8/26/03 CAP/CGS 5991: Lecture 1 16

Sequence Alignment HBA_HUMAN HBB_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN LGB2_LUPLU GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN F11G11.2 GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE HBA_HUMAN: Human Alpha Globin HBB_HUMAN: Human Beta Globin F11G11.2 : Leghaemoglobin from yellow lupin Needleman-Wunsch Smith-Waterman 8/26/03 CAP/CGS 5991: Lecture 1 17

Sequence Alignment Input: Sequence A, Sequence B, Database D Question(s): AlignA and B. Determine similarity (A, B). AlignA and D. Find sequence in D with maximum similarity to A. Many More variants 8/26/03 CAP/CGS 5991: Lecture 1 18

Molecular Biology Background 8/26/03 CAP/CGS 5991: Lecture 1 19

The 2 star players Protein DNA 8/26/03 CAP/CGS 5991: Lecture 1 20

The Players DNA String with alphabet {A, C, G, T} Nucleotides/Bases RNA String with alphabet {A, C, G, U} Bases Protein String with 20-letter alphabet Amino acids/residues 8/26/03 CAP/CGS 5991: Lecture 1 21

Central Dogma DNA acts as a template to replicate itself. DNA is transscribed into RNA. RNA is translated into Protein. DNA RNA Protein 8/26/03 CAP/CGS 5991: Lecture 1 22

Chromosomes 8/26/03 CAP/CGS 5991: Lecture 1 23

Chromosomes 8/26/03 CAP/CGS 5991: Lecture 1 24

DNA Molecule 8/26/03 CAP/CGS 5991: Lecture 1 25

DNA Complementary Bases 8/26/03 CAP/CGS 5991: Lecture 1 26

Proteins Amino acids 8/26/03 CAP/CGS 5991: Lecture 1 27

RNA 8/26/03 CAP/CGS 5991: Lecture 1 28

The Genetic Code 8/26/03 CAP/CGS 5991: Lecture 1 29

8/26/03 CAP/CGS 5991: Lecture 1 30

DNA RNA Protein 8/26/03 CAP/CGS 5991: Lecture 1 31

8/26/03 CAP/CGS 5991: Lecture 1 32

8/26/03 CAP/CGS 5991: Lecture 1 33

DNA Transcription 8/26/03 CAP/CGS 5991: Lecture 1 34

Transcription Initiation 8/26/03 CAP/CGS 5991: Lecture 1 35

Transcription 8/26/03 CAP/CGS 5991: Lecture 1 36

Transcription Steps RNA polymerase needs many transcription factors (TFIIA,TFIIB, etc.) (A) The promoter sequence (TATA box) is located 25 nucleotides away from transcription initiation site. (B) The TATA box is recognized and bound by transcription factor TFIID, which then enables the adjacent binding of TFIIB. DNA is somewhat distorted in the process. (D) The rest of the general transcription factors as well as the RNA polymerase itself assemble at the promoter. What order? (E) TFIIH then uses ATP to phosphorylate RNA polymerase II, changing its conformation so that the polymerase is released from the complex and is able to start transcribing. As shown, the site of phosphorylation is a long polypeptide tail that extends from the polymerase molecule. 8/26/03 CAP/CGS 5991: Lecture 1 37

Transcription Factors The general transcription factors have been highly conserved in evolution; some of those from human cells can be replaced in biochemical experiments by the corresponding factors from simple yeasts. 8/26/03 CAP/CGS 5991: Lecture 1 38

Protein Synthesis 8/26/03 CAP/CGS 5991: Lecture 1 39

Protein Synthesis: Incorporation of amino acid into protein 8/26/03 CAP/CGS 5991: Lecture 1 40

8/26/03 CAP/CGS 5991: Lecture 1 41