String Matching Problem

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence analysis and Genomics

Algorithms in Bioinformatics

String Search. 6th September 2018

INF 4130 / /8-2017

Motivating the need for optimal sequence alignments...

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Lecture 4: September 19

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Lecture 2: Pairwise Alignment. CG Ron Shamir

INF 4130 / /8-2014

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Module 9: Tries and String Matching

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Local Alignment: Smith-Waterman algorithm

Analysis and Design of Algorithms Dynamic Programming

BLAST: Basic Local Alignment Search Tool

Pattern Matching (Exact Matching) Overview

Pairwise sequence alignment

Single alignment: Substitution Matrix. 16 march 2017

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Genomics and bioinformatics summary. Finding genes -- computer searches

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Computational Biology

CGS 5991 (2 Credits) Bioinformatics Tools

Bioinformatics for Biologists

Sequence analysis and comparison

Large-Scale Genomic Surveys

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

In-Depth Assessment of Local Sequence Alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

Practical considerations of working with sequencing data

Lecture 5: September Time Complexity Analysis of Local Alignment

Introduction to Bioinformatics

Bioinformatics and BLAST

Bio nformatics. Lecture 3. Saad Mneimneh

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Lecture 5,6 Local sequence alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Alignment (chapter 6)

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Moreover, the circular logic

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence Comparison. mouse human

Collected Works of Charles Dickens

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tools and Algorithms in Bioinformatics

1.5 Sequence alignment

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

EECS730: Introduction to Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Tools and Algorithms in Bioinformatics

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Copyright 2000 N. AYDIN. All rights reserved. 1

2. Exact String Matching

Introduction to Bioinformatics

Non-parametric tests, part A:

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Pairwise & Multiple sequence alignments

Pairwise sequence alignments

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

EECS730: Introduction to Bioinformatics

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Alignment & BLAST. By: Hadi Mozafari KUMS

Sequence Alignment Techniques and Their Uses

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Contents. Acknowledgments. xix

Comparative genomics: Overview & Tools + MUMmer algorithm

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

IDL Advanced Math & Stats Module

Linear-Space Alignment

Heuristic Alignment and Searching

Rank-Based Methods. Lukas Meier

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

On-line String Matching in Highly Similar DNA Sequences

BLAST: Target frequencies and information content Dannie Durand

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Whole Genome Alignments and Synteny Maps

Transcription:

String Matching Problem Pattern P Text T Set of Locations L 9/2/23 CAP/CGS 5991: Lecture 2

Computer Science Fundamentals Specify an input-output description of the problem. Design a conceptual algorithm and analyze it. Design data structures to refine the algorithm. Write the program in parts and test the parts separately. 9/2/23 CAP/CGS 5991: Lecture 2

Input: Output: Text T; Pattern P All occurrences of P in T. Methods: Naïve Method O(mn) time Rabin-Karp Method O(mn) time; Fast on average. FSA-based method O(n+mA) time Knuth-Morris-Pratt algorithm O(n+m) time Boyer-Moore O(mn) time; Very fast on average. Suffix Tree method; O(m+n) time Shift-And method; Fast on average; Bit operations. 9/2/23 CAP/CGS 5991: Lecture 2

Evolution of Data Structures Complex problems often required complex data structures. Simple data types: Lists. Applications of lists include: students roster, voters list, grocery list, list of transactions, Array implementation of a list. Advantage random access. Need for list operations arose Static vs. dynamic lists. Storing items in a list vs. Maintaining items in a list. Lot of research on Sorting and Searching resulted. Inserting in a specified location in a list caused the following evolution: Array implementation Linked list implementation. Other linear structures e.g., stacks, queues, etc. 9/2/23 CAP/CGS 5991: Lecture 2

Evolution of Data Structures (Cont d) Trees made hierarchical organization of data easy to handle. Applications of trees: administrative hierarchy in a business set up, storing an arithmetic expression, organization of the functions calls of a recursive program, etc. Search trees (e.g., BST) were designed to make search and retrieval efficient in trees. A BST may not allow fast search/retrieval, if it is very unbalanced, since the time complexities of the operations depended on the height of the tree. Hence the study of balanced trees and nearly balanced trees. Examples: AVL trees, 2-3 trees, 2-3-4 trees, RB trees, Skip lists, etc. Graphs generalize trees; model more general networks. Abstract data types. Advantages include: Encapsulation of data and operations, hiding of unnecessary details, localization and debugging of errors, ease of use since interface is clearly specified, ease of program development, etc. 9/2/23 CAP/CGS 5991: Lecture 2

Why is Statistics important in Bioinformatics? Random processes are inherent in biological processes (e.g., evolution) & in sampling (data collection). Errors are often unavoidable in the data collection process. Statistics helps in studying trends, interpolations, extrapolations, categorizations, classifications, inferences, models, predictions, assigning confidence to predictions, 9/2/23 CAP/CGS 5991: Lecture 2

Normal/Gaussian Distribution 9/2/23 CAP/CGS 5991: Lecture 2

Density & Cumulative Distribution Functions 9/2/23 CAP/CGS 5991: Lecture 2

Graphical Techniques: Scatter Plot 9/2/23 CAP/CGS 5991: Lecture 2

Common Discrete Distributions Binomial Geometric Poisson 9/2/23 CAP/CGS 5991: Lecture 2

Common Continuous Distributions Normal Exponential Gamma 9/2/23 CAP/CGS 5991: Lecture 2

Monte Carlo methods Numerical statistical simulation methods that utilize sequences of random numbers to perform the simulation. The primary components of a Monte Carlo simulation method include the following: Probability distribution functions (pdf's) --- the system described by a set of pdf's. Random number generator --- uniformly distributed on the unit interval. Sampling rule --- a prescription for sampling from the specified pdf's. Scoring (or tallying) --- the outcomes must be accumulated into overall tallies or scores for the quantities of interest. Error estimation --- an estimate of the statistical error (variance) as a function of the number of trials and other quantities must be determined. Variance reduction techniques --- methods for reducing the variance in the estimated solution to reduce the computational time for Monte Carlo simulation. 9/2/23 CAP/CGS 5991: Lecture 2

Parametric & Non-parametric equivalents Type of test 2-sample Paired sample Distribution >2 samples Correlation Crossed comparisons Multiple comparisons Parametric test t-test Paired t-test Chi-square 1-way ANOVA Pearson s correlation Factorial ANOVA Tukey, SNK, Dunnett s, Scheffe s Nonparametric test Mann-Whitney U-test Wilcoxon Kolmogorov-Smirnov Kruskal-Wallis Spearman s correlation Friedman s Quade Nonparametric version of its parametric equivalents 9/2/23 CAP/CGS 5991: Lecture 2

Selection of Statistical Tests 9/2/23 CAP/CGS 5991: Lecture 2

Selection of Multiple Comparison Tests 9/2/23 CAP/CGS 5991: Lecture 2

Why Sequence Analysis? Mutation in DNA is a natural evolutionary process. Thus sequence similarity may indicate common ancestry. In biomolecular sequences (DNA, RNA, protein), high sequence similarity implies significant structural and/or functional similarity. Errors possible in database. 9/2/23 CAP/CGS 5991: Lecture 2

V-sis Oncogene - Homologies Score E Sequences producing significant alignments: (bits) Value gi 332623 gb J2396.1 SEG_SSVPCS2 Simian sarcoma virus v-si... 4591. gi 61774 emb V121.1 RESSV1 Simian sarcoma virus proviral... 454. gi 332622 gb J2395.1 SEG_SSVPCS1 Simian sarcoma virus LTR... 1283. gi 885929 gb U2589.1 GLU2589 Gibbon leukemia virus envelo... 114. gi 45568 ref NM_268.1 Homo sapiens platelet-derived g... 954. gi 2987438 gb BC29822.1 Homo sapiens, platelet-derived g... 954. gi 33821 gb M12783.1 HUMSISPDG Human c-sis/platelet-derive... 954. 9/2/23 CAP/CGS 5991: Lecture 2

Sequence Alignment >gi 45568 ref NM_268.1 Homo sapiens platelet-derived growth factor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog) (PDGFB), transcript variant 1, mrna Length = 3373 Score = 954 bits (481), Expect =. Identities = 634/681 (93%), Gaps = 3/681 (%) Strand = Plus / Plus Query: 115 agggggaccccattcctgaggagctctataagatgctgagtggccactcgattcgctcct 174 Sbjct: 184 agggggaccccattcccgaggagctttatgagatgctgagtgaccactcgatccgctcct 1143 > 21 E G D P I P E E L Y E M L S D H S I R S Query: 175 tcgatgacctccagcgcctgctgcagggagactccggaaaagaagatggggctgagctgg 1134 Sbjct: 1144 ttgatgatctccaacgcctgctgcacggagaccccggagaggaagatggggccgagttgg 123 > 61 D L N M T R S H S G G E L E S L A R G R 9/2/23 CAP/CGS 5991: Lecture 2

Sequence Alignment Sequence 1 gi 332624 Simian sarcoma virus v-sis transforming protein p28 gene, complete cds; and 3' LTR long terminal repeat, complete sequence. Length 2984 (1.. 2984) Sequence 2 gi 45568 Homo sapiens platelet-derived growth factor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog) (PDGFB), transcript variant 1, mrna Length 3373 (1.. 3373) 9/2/23 CAP/CGS 5991: Lecture 2

Similarity vs. Homology Homologous sequences share common ancestry. Similarsequences are near to each other by some criteria. Similarity can be measured using appropriate criteria. 9/2/23 CAP/CGS 5991: Lecture 2

Types of Sequence Alignments Global Alignment: similarity over entire length Local Alignment: no overall similarity, but some segment(s) is/are similar Semi-global Alignment: end segments may not be similar Multiple Alignment: similarity between sets of sequences 9/2/23 CAP/CGS 5991: Lecture 2

Homework #1 Check out homework #1 on course homepage. 9/2/23 CAP/CGS 5991: Lecture 2

Global Sequence Alignment Needleman-Wunsch-Sellers algorithm. Dynamic Programming (DP) based. Overlapping Subproblems Recurrence Relation Table to store solutions to subproblems Ordering of subproblems to fill table Traceback to find solution 9/2/23 CAP/CGS 5991: Lecture 2

Global Alignment: An example V: G A A T T C A G T T A W: G G A T C G A 9/2/23 CAP/CGS 5991: Lecture 2

Recurrence Relation for Needleman-Wunsch-Sellers S[I, J] = MAXIMUM { S[I-1, J-1] + δ(v[i], W[J]), S[I-1, J] + δ(v[i], ), S[I, J-1] + δ(, W[J]) } 9/2/23 CAP/CGS 5991: Lecture 2

Global Alignment: An example V: G A A T T C A G T T A W: G G A T C G A 9/2/23 CAP/CGS 5991: Lecture 2

Traceback V: G A A T T C A G T T A W: G G A T C G - - A 9/2/23 CAP/CGS 5991: Lecture 2

Alternative Traceback V: G - A A T T C A G T T A W: G G - A T C G - - A 9/2/23 CAP/CGS 5991: Lecture 2

Improved Traceback G A A T T C A G T T A G 1 1 1 1 1 1 1 1 1 1 1 G 1 1 1 1 1 1 1 2 2 2 2 A 1 1 2 2 2 2 2 2 2 2 3 T 1 2 2 3 3 3 3 3 3 3 3 C 1 2 2 3 3 4 4 4 4 4 4 G 1 2 2 3 3 4 4 5 5 5 5 A 1 2 3 3 3 4 5 5 5 5 6 V: G A - A T T C A G T T A W: G - G A T C G - - A 9/2/23 CAP/CGS 5991: Lecture 2

Subproblems Optimally align V[1..I] and W[1..J] for every possible values of I and J. Having optimally aligned V[1..I-1] and W[1..J-1] V[1..I] and W[1..J-1] V[1..I-1] and W[1, J] It is possible to optimally align V[1..I] and W[1..J] 9/2/23 CAP/CGS 5991: Lecture 2

Time Complexity O(mn), where m = length of V, and n = length of W. 9/2/23 CAP/CGS 5991: Lecture 2

Generalizations of Similarity Function Mismatch Penalty = α Spaces (Insertions/Deletions, InDels) = β Affine Gap Penalties: (Gap open, Gap extension) = (γ,δ) Weighted Mismatch = Φ(a,b) Weighted Matches = Ω(a) 9/2/23 CAP/CGS 5991: Lecture 2

Alternative Scoring Schemes G A A T T C A G T T A -2-3 -4-5 -6-7 -8-9 -1-11 -12 G -2 1-1 -2-3 -4-5 -6-7 -8-9 -1 G -3-1 -1-3 -4-5 -6-7 -5-7 -8-9 A -4-2 -2-3 -4-5 -6-7 -8-7 T -5-3 -2-2 1-1 -2-3 -4-5 -6-7 C -6-4 -3-3 -1-1 -2-3 -4-5 -6 G -7-5 -4-4 -2-3 -2-2 -1-3 -4-5 A -8-6 -5-5 -3-4 -3-1 -3-3 -5-3 Match +1 Mismatch 2 Gap (-2, -1) V: G A A T T C A G T T A W: G G A T - C G - - A 9/2/23 CAP/CGS 5991: Lecture 2

Local Sequence Alignment Example:comparing long stretches of anonymous DNA; aligning proteins that share only some motifs or domains. Smith-Waterman Algorithm 9/2/23 CAP/CGS 5991: Lecture 2

Recurrence Relations (Global vs Local Alignments) S[I, J] = MAXIMUM { S[I-1, J-1] + δ(v[i], W[J]), S[I-1, J] + δ(v[i], ), S[I, J-1] + δ(, W[J]) } ---------------------------------------------------------------------- S[I, J] = MAXIMUM {, S[I-1, J-1] + δ(v[i], W[J]), S[I-1, J] + δ(v[i], ), S[I, J-1] + δ(, W[J]) } 9/2/23 CAP/CGS 5991: Lecture 2 Global Alignment Local Alignment

Local Alignment: Example G A A T T C A G T T A G 1 G 1 1 A 2 1 1 1 T 1 2 1 1 1 C 2 G 1 A 1 1 1 1 Match +1 Mismatch 1 Gap (-1, -1) V: - G A A T T C A G T T A W: G G - A T - C G - - A 9/2/23 CAP/CGS 5991: Lecture 2

Why Gaps? Example: Finding the gene site for a given (eukaryotic) cdna requires gaps. Example:HIV-virus strains Strain 1 Strain 2 Strain 3 Strain 4 9/2/23 CAP/CGS 5991: Lecture 2

What is cdna? cdna = Copy DNA DNA Transcription cdna Translation mrna Reverse Transcription Protein 9/2/23 CAP/CGS 5991: Lecture 2

Properties of Smith-Waterman Algorithm How to find all regions of high similarity? Find all entries above a threshold score and traceback. What if: Matches = 1 & Mismatches/spaces =? Longest Common Subsequence Problem What if: Matches = 1 & Mismatches/spaces = -? Longest Common Substring Problem What if the average entry is positive? Global Alignment 9/2/23 CAP/CGS 5991: Lecture 2

How to score mismatches? 9/2/23 CAP/CGS 5991: Lecture 2

BLOSUM n Substitution Matrices For each amino acid pair a, b For each BLOCK Align all proteins in the BLOCK Eliminate proteins that are more than n% identical Count F(a), F(b), F(a,b) Compute Log-odds Ratio log F( a, b) F( a) F( b) 9/2/23 CAP/CGS 5991: Lecture 2

String Matching Problem Pattern P Text T Set of Locations L 9/2/23 CAP/CGS 5991: Lecture 2

(Approximate) String Matching Input: Text T, Pattern P Question(s): Does P occur in T? Find one occurrence of P in T. Find all occurrences of P in T. Count # of occurrences of P in T. Find longest substring of P in T. Find closest substring of P in T. Locate direct repeats of P in T. Many More variants Applications: Is P already in the database T? Locate P in T. Can P be used as a primer for T? Is P homologous to anything in T? Has P been contaminated by T? Is prefix(p) = suffix(t)? Locate tandem repeats of P in T. 9/2/23 CAP/CGS 5991: Lecture 2

Input: Output: Text T; Pattern P All occurrences of P in T. Methods: Naïve Method Rabin-Karp Method FSA-based method Knuth-Morris-Pratt algorithm Boyer-Moore Suffix Tree method Shift-And method 9/2/23 CAP/CGS 5991: Lecture 2

Naive Strategy ATAQAANANASPVANAGVERANANESISITALVDANANANANAS ANANAS ANANAS ANANAS ANANAS 9/2/23 CAP/CGS 5991: Lecture 2

Finite State Automaton ANANAS A A A N A 1 2 3 4 Finite State Automaton 7 A S A 6 A N 5 N ATAQAANANASPVANAGVERANANESISITALVDANANANANAS 9/2/23 CAP/CGS 5991: Lecture 2

State Transition Diagram - A AN ANA ANAN ANANA ANANAS A N S * 1 1 1 2 2 3 3 1 4 4 5 5 1 4 6 6 1 9/2/23 CAP/CGS 5991: Lecture 2

Input: Output: Text T; Pattern P All occurrences of P in T. Sliding Window Strategy: Initialize window on T; While (window within T) do Scan: if (window = P) then report it; Shift: shift window to right (by?? positions) endwhile; 9/2/23 CAP/CGS 5991: Lecture 2

Tries Storing: BIG BIGGER BILL GOOD GOSH 9/2/23 CAP/CGS 5991: Lecture 2

Suffix Tries & Compact Suffix Tries Store all suffixes of GOOGOL$ Suffix Trie 9/2/23 CAP/CGS 5991: Lecture 2 Compact Suffix Trie

Suffix Tries to Suffix Trees Compact Suffix Trie Suffix Tree 9/2/23 CAP/CGS 5991: Lecture 2

Suffix Trees Linear-time construction! String Matching, Substring matching, substring common to k of n strings All-pairs prefix-suffix problem Repeats & Tandem repeats Approximate string matching 9/2/23 CAP/CGS 5991: Lecture 2