-max_target_seqs: maximum number of targets to report

Similar documents
Bioinformatics methods COMPUTATIONAL WORKFLOW

Functional Annotation

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

EBI web resources II: Ensembl and InterPro

Week 10: Homology Modelling (II) - HHpred

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

CSCE555 Bioinformatics. Protein Function Annotation

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Protein function prediction based on sequence analysis

GEP Annotation Report

Introduction to Bioinformatics

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Genome Annotation Project Presentation

SUPPLEMENTARY INFORMATION

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST. Varieties of BLAST

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Hidden Markov Models (HMMs) and Profiles

Gene function annotation

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

RGP finder: prediction of Genomic Islands

Lecture 2. The Blast2GO annotation framework

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CS612 - Algorithms in Bioinformatics

functional annotation preliminary results

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Supplemental Materials

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Structure to Function. Molecular Bioinformatics, X3, 2006

In Silico Identification and Characterization of Effector Catalogs

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

objective functions...

Sequence analysis and comparison

Large-Scale Genomic Surveys

Genomics and bioinformatics summary. Finding genes -- computer searches

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Multiple sequence alignment

Sequence Alignment Techniques and Their Uses

Mitochondrial Genome Annotation

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

EECS730: Introduction to Bioinformatics

Comparative Bioinformatics Midterm II Fall 2004

Computational Biology: Basics & Interesting Problems

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Bioinformatics. Dept. of Computational Biology & Bioinformatics

In-Silico Approach for Hypothetical Protein Function Prediction

EBI web resources II: Ensembl and InterPro

Meiothermus ruber Genome Analysis Project

Some Problems from Enzyme Families

Hands-On Nine The PAX6 Gene and Protein

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Introduction to Bioinformatics

Annotation of Drosophila grimashawi Contig12

A Protein Ontology from Large-scale Textmining?

Networks & pathways. Hedi Peterson MTAT Bioinformatics

GO ID GO term Number of members GO: translation 225 GO: nucleosome 50 GO: calcium ion binding 76 GO: structural

Functional Annotation & Comparative Genomics. Lu Wang, Georgia Tech

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Homology and Information Gathering and Domain Annotation for Proteins

Protein bioinforma-cs. Åsa Björklund CMB/LICR

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Predicting Protein Functions and Domain Interactions from Protein Interactions

Algorithms in Bioinformatics

Computational Molecular Biology (

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Protein Structure Prediction Using Neural Networks

Supplementary Information

Bioinformatics for Biologists

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Database update 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

FUNCTION ANNOTATION PRELIMINARY RESULTS

CAP 5510 Lecture 3 Protein Structures

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Basic Local Alignment Search Tool

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Bioinformatics and BLAST

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Intro Protein structure Motifs Motif databases End. Last time. Probability based methods How find a good root? Reliability Reconciliation analysis

Tutorial 4 Substitution matrices and PSI-BLAST

BMD645. Integration of Omics

Comparative genomics: Overview & Tools + MUMmer algorithm

Gene Ontology and overrepresentation analysis

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Bioinformatics Exercises

Similarity searching summary (2)

Transcription:

Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs: maximum number of targets to report -perc_identity: percentage identity cutoff -task blastn-short : for short queries Output format: -outfmt "6 col1 col2 col3 std: commonly used 16 columns stitle: description line of the target sequence

Translate RNA sequence to Protein sequence 1. 6-frame translation. 2. ORF detection tool to find the correct frame ORF length HMM model trained on a set of genuine proteins (from the same species to capture the correct codon bias signal) BLAST or PFAM scan

TransDecoder an ORF finder tool of the Trinity package TransDecoder -t transcript.fasta Other parameters: -S: only analyze top strand Training the HMM --train training.fasta : a set of high confidence transcripts --cd_hit_est: path to CD-hit-est tool, a clustering tool to produce a nonredundant protein set -G: genetic code Output: --retain_long_orfs 900: all ORF longer than 900 nt will be retained -m 100: minimum protein length

The ORF prediction can refined by homology protein search blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 - num_threads 10 > blastp.outfmt6 hmmscan --cpu 8 --domtblout pfam.domtblout /path/to/pfam-a.hmm transdecoder_dir/longest_orfs.pep TransDecoder.Predict -t target_transcripts.fasta -- retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6

HMMs are trained from a multiple sequence alignment

Hidden Markov Model (HMM) is more general than PSSM Pagni, et al. EMBnetCourse 2003

Match a sequence to a model Application: Function Prediction Pagni, et al. EMBnetCourse 2003

>unknown_protein MALLYRRMSMLLNIILAYIFLCAICVQGSVKQEWAEIGKNVSLECASENEAVAWKLGNQTINKNHTRYKI RTEPLKSNDDGSENNDSQDFIKYKNVLALLDVNIKDSGNYTCTAQTGQNHSTEFQVRPYLPSKVLQSTPD RIKRKIKQDVMLYCLIEMYPQNETTNRNLKWLKDGSQFEFLDTFSSISKLNDTHLNFTLEFTEVYKKENG TYKCTVFDDTGLEITSKEITLFVMEVPQVSIDFAKAVGANKIYLNWTVNDGNDPIQKFFITLQEAGTPTF TYHKDFINGSHTSYILDHFKPNTTYFLRIVGKNSIGNGQPTQYPQGITTLSYDPIFIPKVETTGSTASTI TIGWNPPPPDLIDYIQYYELIVSESGEVPKVIEEAIYQQNSRNLPYMFDKLKTATDYEFRVRACSDLTKT CGPWSENVNGTTMDGVATKPTNLSIQCHHDNVTRGNSIAINWDVPKTPNGKVVSYLIHLLGNPMSTVDRE MWGPKIRRIDEPHHKTLYESVSPNTNYTVTVSAITRHKKNGEPATGSCLMPVSTPDAIGRTMWSKVNLDS KYVLKLYLPKISERNGPICCYRLYLVRINNDNKELPDPEKLNIATYQEVHSDNVTRSSAYIAEMISSKYF RPEIFLGDEKRFSENNDIIRDNDEICRKCLEGTPFLRKPEIIHIPPQGSLSNSDSELPILSEKDNLIKGA NLTEHALKILESKLRDKRNAVTSDENPILSAVNPNVPLHDSSRDVFDGEIDINSNYTGFLEIIVRDRNNA LMAYSKYFDIITPATEAEPIQSLNNMDYYLSIGVKAGAVLLGVILVFIVLWVFHHKKTKNELQGEDTLTL RDSLRALFGRRNHNHSHFITSGNHKGFDAGPIHRLDLENAYKNRHKDTDYGFLREYEMLPNRFSDRTTKN SDLKENACKNRYPDIKAYDQTRVKLAVINGLQTTDYINANFVIGYKERKKFICAQGPMESTIDDFWRMIW EQHLEIIVMLTNLEEYNKAKCAKYWPEKVFDTKQFGDILVKFAQERKTGDYIERTLNVSKNKANVGEEED RRQITQYHYLTWKDFMAPEHPHGIIKFIRQINSVYSLQRGPILVHCSAGVGRTGTLVALDSLIQQLEEED SVSIYNTVCDLRHQRNFLVQSLKQYIFLYRALLDTGTFGNTDICIDTMASAIESLKRKPNEGKCKLEVEF EKLLATADEISKSCSVGENEENNMKNRSQEIIPYDRNRVILTPLPMRENSTYINASFIEGYDNSETFIIA QDPLENTIGDFWRMISEQSVTTLVMISEIGDGPRKCPRYWADDEVQYDHILVKYVHSESCPYYTRREFYV TNCKIDDTLKVTQFQYNGWPTVDGEVPEVCRGIIELVDQAYNHYKNNKNSGCRSPLTVHCSLGTDRSSIF VAMCILVQHLRLEKCVDICATTRKLRSQRTGLINSYAQYEFLHRAIINYSDLHHIAESTLD PFAM a pre-constructed HMM model database for protein function domain prediction http://pfam.sanger.ac.uk/

Other prediction tools SignalP: Predicting Signal Peptide

Other prediction tools TMHMM: trans-membrane proteins

Summary RNA -> Protein: 1. 6-frame/3-frame translation translation 2. ORF detection tool, e.g. TransDecoder BLAST: PSSM (position specific scoring matrix) PFAM (a collection of HMM models) SignalP TMHMM Others Homologs in other species Protein domain with known function Various protein features, eg. Transmembrane, signal peptide Gene ontology

Gene Ontology

Gene Ontology GRMZM2G035341 molecular_function GO:0008270 zinc ion binding GRMZM2G035341 molecular_function GO:0046872 metal ion binding GRMZM2G035341 cellular_component GO:0005622 intracellular GRMZM2G035341 cellular_component GO:0019005 SCF ubiquitin ligase complex GRMZM2G035341 biological_process GO:0009733 response to auxin GRMZM2G047813 molecular_function GO:0003677 DNA binding GRMZM2G047813 cellular_component GO:0005634 nucleus GRMZM2G047813 cellular_component GO:0005694 chromosome GRMZM2G047813 biological_process GO:0006259 DNA metabolic process cellular nitrogen compound GRMZM2G047813 biological_process GO:0034641 metabolic process

GO->GOSLIM GRMZM2G035341 molecular_function GO:0008270 zinc ion binding GRMZM2G035341 molecular_function GO:0046872 metal ion binding GRMZM2G035341 cellular_component GO:0005622 intracellular GRMZM2G035341 cellular_component GO:0019005 SCF ubiquitin ligase complex GO category distribution GRMZM2G035341 biological_process GO:0009733 response to auxin GRMZM2G047813 molecular_function GO:0003677 DNA binding GRMZM2G047813 cellular_component GO:0005634 nucleus GRMZM2G047813 cellular_component GO:0005694 chromosome GRMZM2G047813 biological_process GO:0006259 DNA metabolic process cellular nitrogen compound GRMZM2G047813 biological_process GO:0034641 metabolic process

The Necessity for GO Slim

The Necessity for GO Slim GO Slim To download premade GO Slim: http://www.geneontology.org/go.slims.shtml Create your own GO Slim: http://oboedit.org/docs/html/creating_your_own_go_slim_in_obo_edit.htm

High throughput gene function prediction BLAST2GO Function prediction based on BLAST match to known proteins. http://www.blast2go.com Interproscan Function prediction mostly based on PFAM, PSI-BLAST and other motif scaning tools. http://www.ebi.ac.uk/interpro/ Trinotate Function prediction based on BLAST, PFAM, SignalP, TMHMM, RNAmmer http://trinotate.github.io/

InterProScan

InterProScan Program Name Description Abbreviation BlastProDom Scans the families in the ProDomdatabase. ProDomis a comprehensive set of protein ProDom domain families automatically generated from the UniProtKB/Swiss-Protand UniProtKB/TrEMBLsequence databases using psi-blast. FPrintScan Scans against the fingerprints in the PRINTS database. These fingerprints are groups of PRINTS motifs that together are more potent than single motifs by making use of the biological context inherent in a multiple motif method. HMMPIR Scans the hidden markov models (HMMs) that are present in the PIR Protein Sequence PIRSF Database (PSD) of functionally annotated protein sequences, PIR-PSD. HMMPfam Scans the hidden markov models (HMMs) that are present in the PFAM Protein PfamA families database. HMMSmart Scans the hidden markov models (HMMs) that are present in the SMART SMART domain/domain families database. HMMTigr Scans the hidden markov models (HMMs) that are present in the TIGRFAMs protein families database. TIGRFAM ProfileScan Scans against PROSITE profiles. These profiles are based on weight matrices and are PrositeProfiles more sensitive for the detection of divergent protein families. HAMAP Scans against HAMAP profiles. These profiles are based on weight matrices and are more sensitive for the detection of divergent bacterial, archaeal and plastid-encoded protein families. HAMAP PatternScan PatternScan is a new version of the PROSITE pattern search software which uses new PrositePatterns code developed by the PROSITE team. SuperFamily SUPERFAMILY is a library of profile hidden Markov models that represent all proteins SuperFamily of known structure. SignalPHMM SignalP TMHMM TMHMM HMMPanther Panther Gene3D Gene3d Phobius Phobius Coils Coils

Trinotate(Trinity package) 1. Predict open-reading-frame(dna -> protein sequence) TransDecoder-t Trinity.fasta--workdirtransDecoder-S 2. Predict gene function from sequences Trinotate package BLAST Uniprot: homologs in known proteins PFAM: protein domain) SignalP: signal peptide) TMHMM: trans-membrane domain RNAMMER: rrna Output: go_annotations.txt (Gene Ontology annotation file) Step-by-step guidance: https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=143#c

BLAST2GO Annotation Steps BLAST: BLAST against NCBI nr or Swissprot database; Recommended Mapping:Retrieve GO from annotated homologous genes; Annotation: Assign GO terms to query sequences. InterProScan(optional): Integrate with InterProScan results.

BLAST2GO Do each steps separately. BLAST step BLAST2GO step Run Command line BLASTX on a BioHPC computer BioHPC Lab computer through VNC

Using BRC Bioinformatics Facility Resource 1. Office hour 1pm to 3pm every Monday, 618 Rhodes Hall Signup at: http://cbsu.tc.cornell.edu/lab/office1.aspx 2. Step-by-step instruction using software on BioHPC computers. Software page: http://cbsu.tc.cornell.edu/lab/labsoftware.aspx BLAST2GO page: http://cbsu.tc.cornell.edu/lab/doc/instruction_blast2go.htm