Pattern matching in highly similar sequences

Size: px
Start display at page:

Download "Pattern matching in highly similar sequences"

Transcription

1 Pattern matching in highly similar sequences Thierry Lecroq joint work with N. Ben Nsira et É. Prieur-Gaston Laboratoire d Informatique, du Traitement de l Information et des Systèmes (LITIS EA4108) Université de Rouen Normandie, France Mathematics / Computer Science Day 12 October 2018 Rouen France

2 Outline 1 Bioinformatics 2 Examples of projects 3 Search in highly similar sequences Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

3 Outline 1 Bioinformatics 2 Examples of projects 3 Search in highly similar sequences Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

4 Bioinformaticians Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

5 Bioinformatics (in silico biology) Extrait de Wikipédia, mai 2015 La bio-informatique est constituée par l ensemble des concepts et des techniques nécessaires à l interprétation informatique de l information biologique. Plusieurs champs d application ou sous-disciplines de la bio-informatique se sont constitués : La bio-informatique des séquences, [...] La bio-informatique structurale, [...] La bio informatique des réseaux, [...] La bio-informatique statistique et la bio-informatique des populations. Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

6 Genetics Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

7 Size of genomes (Mb) Escherichia coli (bacteria) 4,6 Saccharomyces cerevisiae (yeast) 13 C. elegans (worm) 100 Arabidopsis thaliana (plant) 125 Drosophila melanogaster (fly) 180 Rice 400 Homo sapiens 3300 Fern Amoeba dubia (amoeba) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

8 From Biology to Computer Science Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

9 Sequencing before 2005: slow and expensive since 2005: NGS, faster and faster, cheaper and cheaper Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

10 Data deluge Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

11 Next Generation Sequencing (NGS) Sequencer Reads millions of short (length 150) fragments called reads 2 types of projects : alignment (mapping) on a reference genome assembly for building a genome Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

12 Mapping Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

13 Resequencing reference genome Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

14 Resequencing sequenced genome reads Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

15 Single Nucleotide Polymorphism (SNP) or Single Nucleotide Variant (SNV) reference genome sequenced genome reads x x x x x... Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

16 Sequencing errors Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

17 Sequencing errors reference genome sequenced genome reads x... Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

18 Outline 1 Bioinformatics 2 Examples of projects 3 Search in highly similar sequences Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

19 Book of exercises (with solutions) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

20 Example Zimin Z 1 = x 1 and ( k > 1) Z k = Z k 1 x k Z k 1 Example: abacabadabacaba Question Give a linear time algorithm for computing all Zimin type prefixes of a given string Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

21 New location Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

22 mirabel bioinfo.univ-rouen.fr/ mirabel Aggregation of software for prediction of microrna targets Collaboration with INSERM U982 DC2N and LMRS UMR 6085 CNRS A. Quillet, C. Saad, G. Ferry, Y. Anouar, N Vergne, T. Lecroq and C. Dubessy Improving bioinformatics prediction of microrna targets by ranks aggregation biorxiv , 2017 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

23 Phaeodactylum tricornutum Differential transcriptomic analysis (RNA-seq) of an algae with 3 phenotypes Collaboration with GlycoMEV EA 4358 and LMRS UMR 6085 C. Ovide, M.-C. Kiefer-Meyer, C. Be rard, N. Vergne, T. Lecroq, C. Plasson, C. Burel, S. Bernard, A. Driouich, P. Lerouge, I. Tournier, H. Dauchel and M. Bardor Comparative in depth RNA sequencing of P. tricornutum s morphotypes reveals specific features of the oval morphotype Scientific Reports, 8, Article number: 14340, 2018 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

24 Third generation read correction Hybrid correction Self-correction tool for evaluate correctors PhD Thesis of Pierre Morisse (grant from Univ. Rouen Normandie 2016) P. Morisse, T. Lecroq and A. Lefebvre Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph Bioinformatics, 2018, accepted Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

25 UMI (Unique Marker Identifier) Conception of new algorithms for processing UMIs from NGS data PhD Thesis of Ahmad Abdel Sater (grant from Normandie Region 2018) Collaboration with INSERM U1245 at CRLCC Henri Becquerel P.-J. Viailly, E. Bohers, M. Viennot, A. Abdel Sater, V. Marchand, P. Ruminy, H. Dauchel, T. Lecroq, M. Becker, P. Etancelin, H. Tilly, P. Vera and F. Jardin I-LowVarFreq: improving low-frequency variant detection using a new UMI-based variant calling approach for paired-end sequencing NGS libraries Proceedings of JOBIM, Marseilles, France, 2018 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

26 Mass spectrometry data Study of Pemphigus Collaboration with INSERM U1234 Reconstruction of peptides H T W Y Q K K P N A A P R Q Q K P N A A P R L L L Y M. Petit, M.-L. Walet-Balieu, P. Chan Tchi Song, L. Drouot, C. Burel, M. Maho-Vaillant, T. Lecroq, P. Cosette, D. Vaudry, O. Boyer, M. Bardor, P. Joly, S. Calbo Longitudinal study of anti-dsg3 IgG repertoire by proteomics in Pemphigus following Rituximab treatment Poster, JNRb, Rouen, France, 2018 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

27 Outline 1 Bioinformatics 2 Examples of projects 3 Search in highly similar sequences Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

28 Pattern matching Find one(all the) position(s) of a pattern of length m in a sequence of length n: with index O(m) without index O(n) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

29 Suffix Trie a t $ g y = a t a t g a t $ a $ t g a 0 atatgat$ a $ 6 1 tatgat$ g t a t 2 atgat$ 5 t a g t $ 3 tgat$ 4 4 gat$ g t a $ 3 5 at$ a $ t 6 t$ 2 7 $ t $ 1 $ 0 7 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

30 Suffix Trie a t $ g y = a t a t g a t $ a $ t g a 0 atatgat$ a $ 6 1 tatgat$ g t a t 2 atgat$ 5 t a g t $ 3 tgat$ 4 4 gat$ g t a $ 3 5 at$ a $ t 6 t$ 2 7 $ t $ 1 $ 0 7 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

31 Suffix Trie a t $ g y = a t a t g a t $ a $ t g a 0 atatgat$ a $ 6 1 tatgat$ g t a t 2 atgat$ 5 t a g t $ 3 tgat$ 4 4 gat$ g t a $ 3 5 at$ a $ t 6 t$ 2 7 $ t $ 1 $ 0 7 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

32 Suffix Tree a t $ g a $ t g a a $ 6 g t a t 5 t a g t $ 4 g t a $ 3 a $ t 2 t $ 1 $ $ 5 gat$ atgat$ at 6 gat$ gat$ atgat$ t $ $ 7 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

33 Suffix Tree a t $ g a $ t g a a $ 6 g t a t 5 t a g t $ 4 g t a $ 3 a $ t 2 t $ 1 $ $ 5 gat$ atgat$ at 6 gat$ gat$ atgat$ t $ $ 7 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

34 Suffix Tree a t a t g a t $ at t $ gat$ atgat$$ atgat$ $ gat$ gat$ (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) [Weiner 73,McCreight 76,Ukkonen 92,Farach 97] Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

35 Suffix Tree a t a t g a t $ at t $ gat$ atgat$$ atgat$ $ gat$ gat$ (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) [Weiner 73,McCreight 76,Ukkonen 92,Farach 97] Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

36 Suffix Tree a t a t g a t $ at t $ gat$ atgat$$ atgat$ $ gat$ gat$ (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) [Weiner 73,McCreight 76,Ukkonen 92,Farach 97] Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

37 Suffix Tree a t a t g a t $ at t $ gat$ atgat$$ atgat$ $ gat$ gat$ (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) [Weiner 73,McCreight 76,Ukkonen 92,Farach 97] Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

38 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

39 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

40 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

41 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

42 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

43 Suffix Tree a t a t g a t $ 0 (0,2) (1,1) (7,1) (4,4) (7,1) (7,1) (2,6) (2,6) (4,4) (4,4) ta is a factor of y tt is not at occurs 3 times at positions 0, 2 and 5 t occurs 3 times at positions 1, 3 and 6 the length of the longest common prefix of suffixes starting at positions 2 and 5 is 2 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

44 Complexities Algorithms for building suffix trees are: on-line linear time linear space Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

45 Highly similar sequences r sequences y 0 y 1 y 2 y 3 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

46 Highly similar sequences r sequences y 0 y 1 y 2 y A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T y A T G C {G, T} A G C A A {C, G} A T A C A {G, T} Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

47 Highly similar sequences r sequences y 0 y 1 y 2 y A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T y A T G C {G, T} A G C A A {C, G} A T A C A {G, T} G A G C A A C Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

48 Highly similar sequences r sequences y 0 y 1 y 2 y 3 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T y 0 et Z = (({2}, 4, G), ({1, 3}, 10, C), ({3}), 16, T) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

49 Sliding window n y x m y x y x Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

50 Knuth-Morris-Pratt algorithm (1977) y j u b x u a z c Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

51 Boyer-Moore Algorithm comparisons x a z y b z x c z Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

52 For highly similar sequences Hamming distance For u, v A such that u = v : Ham(u, v) = {i u[i] v[i]} Longest Common Extension For x A and 0 i j x 1: LCE k x(i, j) = max{l Ham(x[i.. i + l 1], x[j.. j + l 1]) k} Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

53 Kangaroo jumps LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

54 Kangaroo jumps i j LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

55 Kangaroo jumps i j LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

56 Kangaroo jumps i j 1 LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

57 Kangaroo jumps i j 1 2 LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

58 Kangaroo jumps i j LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

59 Kangaroo jumps i j LCE k x(i, j) can be computed in O(k) time after O(n) preprocessing time Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

60 References Restriction: 1 variation on a window of size m N. Ben Nsira, T. Lecroq and M. Elloumi A fast Boyer-Moore type pattern matching algorithm for highly similar sequences International Journal of Data Mining and Bioinformatics 13(3) (2015) N. Ben Nsira, T. Lecroq and M. Elloumi On-line String Matching in Highly Similar DNA Sequences Mathematics in Computer Science 11(2) (2017) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

61 2 variants searching for a finite set of patterns relaxing the restriction from 1 to k variations on a window of size m Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

62 Single pattern with at most k variations Applying the Landau-Vishkin algorithm as a filter Searching with k mismatches in O(kn) When Ham(x, y 0 [j.. j + l 1]) = l k l = 0: an exact occurrence of the pattern has been found in y 0 and all the other sequence that do not have a variation comparing to y 0 between position j and position j + m 1 both included. l > 0: let W = {i 0,..., i l 1 } be the set of the l positions such that y 0 [j + i p ] x[i p ] with 0 p < l. Then x occurs exactly in y h if: (G, j + ip, x[i p ]) Z with g G for all 0 p < l; (G, h, c) Z such that h W. Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

63 Single pattern with at most k variations r = 2 and k = y 0 A C C T A C G A C T A x C T A C T T x C T A C T T y 1 A C C T A C T A C T T Our solution runs in time O(knr) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

64 Single pattern with at most k variations r = 2 and k = y 0 A C C T A C G A C T A x C T A C T T x C T A C T T y 1 A C C T A C T A C T T Our solution runs in time O(knr) Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

65 Multiple patterns with at most 1 variation Build a classical trie of the patterns Scan the highly similar sequences with at most 2 active states Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

66 Multiple patterns with at most 1 variation X = {ACGA, ACTA, CTA} and r = 2 séquences Σ \ {A, C} A 0 1 C 2 G 3 A 4 {ACGA} T C 5 A 6 {ACTA, CTA} 7 T 8 A 9 {CTA} A C C T A C G A C T A y 1 T T active states Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

67 Multiple patterns with at most 1 variation Our solution runs in time O(n) for the searching phase and in time O(s) for the preprocessing phase where s = x for all x X Experiments on similar sequences of different lengths with patterns of length EDSM LVsim ACsim 0.6 Time(s) x x x x10 6 length Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

68 References N. Ben Nsira, T. Lecroq and É. Prieur-Gaston Practical fast exact pattern matching algorithm for highly similar sequences International Conference on Bioinformatics and Biomedicine (BIBM), 2018, Madrid, submitted Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

69 Perspectives Adapt other pattern matching techniques Relax the restrictions Adaptive analysis Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

70 Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

71 Thank you for your attention! Thierry Lecroq (LITIS, URN) Matching in similar sequences 12th October / 50

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

CGS 5991 (2 Credits) Bioinformatics Tools

CGS 5991 (2 Credits) Bioinformatics Tools CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1 Course Schedules CAP 5991 (3 credit) will meet every Tue

More information

Three new strategies for exact string matching

Three new strategies for exact string matching Three new strategies for exact string matching Simone Faro 1 Thierry Lecroq 2 1 University of Catania, Italy 2 University of Rouen, LITIS EA 4108, France SeqBio 2012 November 26th-27th 2012 Marne-la-Vallée,

More information

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des Text Searching Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, du Traitement de l Information et des Systèmes. International PhD School in Formal Languages and Applications Tarragona,

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro Thierry Lecroq University of Catania, Italy University of Rouen, LITIS EA 4108, France Symposium on Eperimental Algorithms

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

Welcome to BIOL 572: Recombinant DNA techniques

Welcome to BIOL 572: Recombinant DNA techniques Lecture 1: 1 Welcome to BIOL 572: Recombinant DNA techniques Agenda 1: Introduce yourselves Agenda 2: Course introduction Agenda 3: Some logistics for BIOL 572 Agenda 4: Q&A section Agenda 1: Introduce

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://www.cs.ucdavis.edu/~koehl/teaching/ecs129/index.html koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

SUFFIX TREE. SYNONYMS Compact suffix trie

SUFFIX TREE. SYNONYMS Compact suffix trie SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix

More information

Multi-Assembly Problems for RNA Transcripts

Multi-Assembly Problems for RNA Transcripts Multi-Assembly Problems for RNA Transcripts Alexandru Tomescu Department of Computer Science University of Helsinki Joint work with Veli Mäkinen, Anna Kuosmanen, Romeo Rizzi, Travis Gagie, Alex Popa CiE

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Cycle «Analyse de données de séquençage à haut-débit»

Cycle «Analyse de données de séquençage à haut-débit» Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Chadi Saad CRIStAL - Équipe BONSAI - Univ Lille, CNRS, INRIA (chadi.saad@univ-lille.fr) Présentation de Sophie Gallina (source:

More information

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein

More information

Applications of genome alignment

Applications of genome alignment Applications of genome alignment Comparing different genome assemblies Locating genome duplications and conserved segments Gene finding through comparative genomics Analyzing pathogenic bacteria against

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Lecture 3: String Matching

Lecture 3: String Matching COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt

More information

Detecting unfolded regions in protein sequences. Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France

Detecting unfolded regions in protein sequences. Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France Detecting unfolded regions in protein sequences Anne Poupon Génomique Structurale de la Levure IBBMC Université Paris-Sud / CNRS France Large proteins and complexes: a domain approach Structural studies

More information

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever. CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio Review Autumn 2004 Larry Ruzzo Related Courses He who asks is a fool for five minutes, but he who does not ask remains

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

String Matching Problem

String Matching Problem String Matching Problem Pattern P Text T Set of Locations L 9/2/23 CAP/CGS 5991: Lecture 2 Computer Science Fundamentals Specify an input-output description of the problem. Design a conceptual algorithm

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

"Omics" - Experimental Approachs 11/18/05

Omics - Experimental Approachs 11/18/05 "Omics" - Experimental Approachs Bioinformatics Seminars "Omics" Experimental Approaches Nov 18 Fri 12:10 BCB Seminar in E164 Lago Using P-Values for the Planning and Analysis of Microarray Experiments

More information

Genome Assembly. Sequencing Output. High Throughput Sequencing

Genome Assembly. Sequencing Output. High Throughput Sequencing Genome High Throughput Sequencing Sequencing Output Example applications: Sequencing a genome (DNA) Sequencing a transcriptome and gene expression studies (RNA) ChIP (chromatin immunoprecipitation) Example

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Compror: On-line lossless data compression with a factor oracle

Compror: On-line lossless data compression with a factor oracle Information Processing Letters 83 (2002) 1 6 Compror: On-line lossless data compression with a factor oracle Arnaud Lefebvre a,, Thierry Lecroq b a UMR CNRS 6037 ABISS, Faculté des Sciences et Techniques,

More information

Algorithms Design & Analysis. String matching

Algorithms Design & Analysis. String matching Algorithms Design & Analysis String matching Greedy algorithm Recap 2 Today s topics KM algorithm Suffix tree Approximate string matching 3 String Matching roblem Given a text string T of length n and

More information

BSC 4934: QʼBIC Capstone Workshop" Giri Narasimhan. ECS 254A; Phone: x3748

BSC 4934: QʼBIC Capstone Workshop Giri Narasimhan. ECS 254A; Phone: x3748 BSC 4934: QʼBIC Capstone Workshop" Giri Narasimhan ECS 254A; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/bsc4934_su10.html July 2010 7/12/10 Q'BIC Bioinformatics 1 Overview of Course"

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Algorithmics and Bioinformatics

Algorithmics and Bioinformatics Algorithmics and Bioinformatics Gregory Kucherov and Philippe Gambette LIGM/CNRS Université Paris-Est Marne-la-Vallée, France Schedule Course webpage: https://wikimpri.dptinfo.ens-cachan.fr/doku.php?id=cours:c-1-32

More information

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings

More information

Least Random Suffix/Prefix Matches in Output-Sensitive Time

Least Random Suffix/Prefix Matches in Output-Sensitive Time Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko Välimäki Department of Computer Science University of Helsinki nvalimak@cs.helsinki.fi 23rd Annual Symposium on Combinatorial Pattern Matching

More information

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi Contents List of Code Challenges xvii About the Textbook xix Meet the Authors................................... xix Meet the Development Team............................ xx Acknowledgments..................................

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2009 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright 2008 Oliver Jovanovic, All Rights Reserved. Genome

More information

Università della Calabria

Università della Calabria Università della Calabria Facoltà di Ingegneria BIOINFORMATICS TECHNIQUES AND METHODOLOGIES Research group coordinated by Prof. Luigi Palopoli Lecturer: Simona Rombo OUTLINE 1. Introduction to Bioinformatics

More information

Linear-Time Computation of Local Periods

Linear-Time Computation of Local Periods Linear-Time Computation of Local Periods Jean-Pierre Duval 1, Roman Kolpakov 2,, Gregory Kucherov 3, Thierry Lecroq 4, and Arnaud Lefebvre 4 1 LIFAR, Université de Rouen, France Jean-Pierre.Duval@univ-rouen.fr

More information

Data Structure for Dynamic Patterns

Data Structure for Dynamic Patterns Data Structure for Dynamic Patterns Chouvalit Khancome and Veera Booning Member IAENG Abstract String matching and dynamic dictionary matching are significant principles in computer science. These principles

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

Ensembl Genomes (non-chordates): Quick tour. This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser.

Ensembl Genomes (non-chordates): Quick tour. This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser. Paul Kersey [1] DNA & RNA Beginner 0.5 hour This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser. Learning objectives: Basic understanding of Ensembl Genomes

More information

Computational Biology From The Perspective Of A Physical Scientist

Computational Biology From The Perspective Of A Physical Scientist Computational Biology From The Perspective Of A Physical Scientist Dr. Arthur Dong PP1@TUM 26 November 2013 Bioinformatics Education Curriculum Math, Physics, Computer Science (Statistics and Programming)

More information

Introduction to Bioinformatics Integrated Science, 11/9/05

Introduction to Bioinformatics Integrated Science, 11/9/05 1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction

More information

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov INF 4130 / 9135 29/8-2012 Today s slides are produced mainly by Petter Kristiansen Lecturer Stein Krogdahl Mandatory assignments («Oblig1», «-2», and «-3»): All three must be approved Deadlines around:

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Paired-End Read Length Lower Bounds for Genome Re-sequencing

Paired-End Read Length Lower Bounds for Genome Re-sequencing 1/11 Paired-End Read Length Lower Bounds for Genome Re-sequencing Rayan Chikhi ENS Cachan Brittany PhD student in the Symbiose team, Irisa, France 2/11 NEXT-GENERATION SEQUENCING Next-gen vs. traditional

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Theoretical distribution of PSSM scores

Theoretical distribution of PSSM scores Regulatory Sequence Analysis Theoretical distribution of PSSM scores Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological Advances for Genomics and Clinics (TAGC,

More information

ECOL/MCB 320 and 320H Genetics

ECOL/MCB 320 and 320H Genetics ECOL/MCB 320 and 320H Genetics Instructors Dr. C. William Birky, Jr. Dept. of Ecology and Evolutionary Biology Lecturing on Molecular genetics Transmission genetics Population and evolutionary genetics

More information

Statistical mass spectrometry-based proteomics

Statistical mass spectrometry-based proteomics 1 Statistical mass spectrometry-based proteomics Olga Vitek www.stat.purdue.edu Outline What is proteomics? Biological questions and technologies Protein quantification in label-free workflows Joint analysis

More information

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it? Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains

More information

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years. Structure Determination and Sequence Analysis The vast majority of the experimentally determined three-dimensional protein structures have been solved by one of two methods: X-ray diffraction and Nuclear

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Bayesian Clustering of Multi-Omics

Bayesian Clustering of Multi-Omics Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Protein function studies: history, current status and future trends

Protein function studies: history, current status and future trends 19 3 2007 6 Chinese Bulletin of Life Sciences Vol. 19, No. 3 Jun., 2007 1004-0374(2007)03-0294-07 ( 100871) Q51A Protein function studies: history, current status and future trends MA Jing, GE Xi, CHANG

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

Genomes Comparision via de Bruijn graphs

Genomes Comparision via de Bruijn graphs Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19 Synteny Blocks: Algorithmic challenge Suppose that we are given two

More information

Linear-Space Alignment

Linear-Space Alignment Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x

More information

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering Proteomics 2 nd semester, 2013 1 Text book Principles of Proteomics by R. M. Twyman, BIOS Scientific Publications Other Reference books 1) Proteomics by C. David O Connor and B. David Hames, Scion Publishing

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs On the Sound Covering Cycle Problem in Paired de Bruijn Graphs Christian Komusiewicz 1 and Andreea Radulescu 2 1 Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany christian.komusiewicz@tu-berlin.de

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1 Chapter

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

Advanced Algorithms and Models for Computational Biology

Advanced Algorithms and Models for Computational Biology Advanced Algorithms and Models for Computational Biology Introduction to cell biology genomics development and probability Eric ing Lecture January 3 006 Reading: Chap. DTM book Introduction to cell biology

More information

Genome Sequencing & DNA Sequence Analysis

Genome Sequencing & DNA Sequence Analysis 7.91 / 7.36 / BE.490 Lecture #1 Feb. 24, 2004 Genome Sequencing & DNA Sequence Analysis Chris Burge What is a Genome? A genome is NOT a bag of proteins What s in the Human Genome? Outline of Unit II: DNA/RNA

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools CAP 5510: to : Tools ECS 254A / EC 2474; Phone x3748; Email: giri@cis.fiu.edu My Homepage: http://www.cs.fiu.edu/~giri http://www.cs.fiu.edu/~giri/teach/bioinfs15.html Office ECS 254 (and EC 2474); Phone:

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Consensus Optimizing Both Distance Sum and Radius

Consensus Optimizing Both Distance Sum and Radius Consensus Optimizing Both Distance Sum and Radius Amihood Amir 1, Gad M. Landau 2, Joong Chae Na 3, Heejin Park 4, Kunsoo Park 5, and Jeong Seop Sim 6 1 Bar-Ilan University, 52900 Ramat-Gan, Israel 2 University

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

arxiv: v2 [cs.ds] 16 Mar 2015

arxiv: v2 [cs.ds] 16 Mar 2015 Longest common substrings with k mismatches Tomas Flouri 1, Emanuele Giaquinta 2, Kassian Kobert 1, and Esko Ukkonen 3 arxiv:1409.1694v2 [cs.ds] 16 Mar 2015 1 Heidelberg Institute for Theoretical Studies,

More information

Faster Algorithms for String Matching with k Mismatches

Faster Algorithms for String Matching with k Mismatches 794 Faster Algorithms for String Matching with k Mismatches Amihood Amir *t Bar-Ilan University and Georgia Tech Moshe Lewenstein* * Bar-Ilan University Ely Porat* Bar Ilan University and Weizmann Institute

More information

Network alignment and querying

Network alignment and querying Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth

More information

Quasi-Linear Time Computation of the Abelian Periods of a Word

Quasi-Linear Time Computation of the Abelian Periods of a Word Quasi-Linear Time Computation of the Abelian Periods of a Word G. Fici 1, T. Lecroq 2, A. Lefebvre 2, É. Prieur-Gaston2, and W. F. Smyth 3 1 Gabriele.Fici@unice.fr, I3S, CNRS & Université Nice Sophia Antipolis,

More information