Collected Works of Charles Dickens
A Random Dickens Quote If there were no bad people, there would be no good lawyers.
Original Sentence It was a dark and stormy night; the night was dark except at sunny intervals, when it was checked by a stormy gust of wind which made the night darker in the streets, fiercely agitating the scanty flame of the lamps that struggled against the darkness.
Problem Are similar phrases present in the sentence? Where, in the Sentence, are these similar phrases? Very Important: How will you help the user visualize this similarity? Not important: How similar are they, exactly? What is the extent of similarity?
Dark and Stormy Night It was a dark and stormy night; the night was dark except at sunny intervals, when it was checked by a stormy gust of wind which made the night darker in the streets, fiercely agitating the scanty flame of the lamps that struggled against the darkness.
Visualizing Similarities Window = 1 Window = 4 Threshold = 4 Sentence It was a dark and stormy night; the night was dark except at sunny intervals Sentence It was a dark and stormy night; the night was dark except at sunny intervals d a r k a n d s t o r m y n i g h t Phrase d a r k a n d s t o r m y n i g h t Phrase
Dot Plots To visualize similarity between sequences Window = 200bp
Unit Outline Dot Plots Simple Alignments Gaps Scoring Matrices Needleman and Wunsch Algorithm Databases Searches
Simple Alignment Pairwise match Match score (1) and Mismatch score (0) Seq 1: AAGATA, Seq 2: AATCTATA Alignments: A G T C T C T A A G G C T A A G T C T C T A A G G C T A A G T C T C T A A G G C T A Scores? n i =1 match score ; seq1 { i =seq2 i } mismatch score ; seq1 i!=seq2 i Substring Problem in rosalind.info: SUBS
Gaps All possible 2 consecutive gaps alignments A G T C T C T A - - A G G C T A A - - G G C T A A G - - G C T A A G G - - C T A A G G C - - T A A G G C T - - A A G G C T A - - n i =1 gap penalty ;if seq1 i =' ' seq2 i =' ' { match score ;if no gaps seq1 i =seq2 i } mismatch score ;if no gaps seq1 i!=seq2 i Match = 1, Mismatch = 0, Gap penalty = -1 AGG CT A, AG GCT A, AG GCTA
Homologs Terms Sequences that share a common ancestor Point Mutations indel events Contiguous indels of nucleotides are more likely AGG CTA vs. AG G CTA Origination Penalty (-2) and Length Penalty (-1) Calculate scores now. Counting Point Mutations Problem: HAMM
Likely Substitutions In a nucleotide mismatch, which substitutions are more likely to occur? A G T C T C A G G C T C A G T C T C A G C C T C Transitions and Transversions Problem: TRAN
For DNA Sequences: Scoring Matrices A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 BLAST Matrix A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 Transition-Transversion Matrix Amino Acids: Polar, Non-polar, Acidic, Basic Residues Hydrophobicity, Charge, Electronegativity, and size Based on observations
Needleman and Wunsch Algorithm A C T C G 0-1 -2-3 -4-5 A -1 C -2 A -3 G -4 T -5 A -6 G -7 Gap Penalty = -1 Match Score = +1 Mismatch Score = 0
Needleman and Wunsch Algorithm A C T C G 0-1 -2-3 -4-5 A -1 1 0-1 -2-3 C -2 A -3 G -4 T -5 A -6 G -7 Gap Penalty = -1 Match Score = +1 Mismatch Score = 0
Needleman and Wunsch Algorithm A C T C G 0-1 -2-3 -4-5 A -1 1 0-1 -2-3 C -2 0 2 1 0-1 A -3 G -4 T -5 A -6 G -7 Gap Penalty = -1 Match Score = +1 Mismatch Score = 0
Needleman and Wunsch Algorithm A C T C G 0-1 -2-3 -4-5 A -1 1 0-1 -2-3 C -2 0 2 1 0-1 A -3-1 1 2 1 0 G -4-2 0 1 2 2 T -5-3 -1 1 1 2 A -6-4 -2 0 1 1 G -7-5 -3-1 0 2 Gap Penalty = -1 Match Score = +1 Mismatch Score = 0
Needleman and Wunsch Algorithm A C T C G 0-1 -2-3 -4-5 A -1 1 0-1 -2-3 C -2 0 2 1 0-1 A -3-1 1 2 1 0 G -4-2 0 1 2 2 T -5-3 -1 1 1 2 A -6-4 -2 0 1 1 G -7-5 -3-1 0 2 Gap Penalty = -1 Match Score = +1 Mismatch Score = 0 A C A G T A G A C T C G
Semi-Global Alignment Terminal gaps are not penalized T A G 0 0 0 0 C 0 A 0 G 0 T 0 A 0 G 0 C 0 A 0 C A G T A G C A T A G Gap Penalty = -1 Match Score = +1 Mismatch Score = 0 No Gap Penalty in the last row and column
Semi-Global Alignment Terminal gaps are not penalized T A G 0 0 0 0 C 0 0 0 0 A 0 0 1 0 G 0 0 0 2 T 0 1 0 1 A 0 0 2 1 G 0 0 1 3 C 0 0 0 3 A 0 0 0 3 C A G T A G C A T A G Gap Penalty = -1 Match Score = +1 Mismatch Score = 0 No Gap Penalty in the last row and column
Semi-Global Alignment Terminal gaps are not penalized T A G 0 0 0 0 C 0 0 0 0 A 0 0 1 0 G 0 0 0 2 T 0 1 0 1 A 0 0 2 1 G 0 0 1 3 C 0 0 0 3 A 0 0 0 3 C A G T A G C A T A G Gap Penalty = -1 Match Score = +1 Mismatch Score = 0 No Gap Penalty in the last row and column
Use the Semi-Global Alignment AACCTATAGCT and GCGATATA A A C C T A T A G C T G C G A T A T A Modify the previous method: Replace negative values with zero
Smith and Waterman Algorithm A A C C T A T A G C T 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 1 0 0 C 0 0 0 1 1 0 0 0 0 0 2 1 G 0 0 0 2 0 0 0 0 0 1 0 1 A 0 1 1 1 0 0 1 0 1 0 0 0 T 0 0 0 0 0 1 0 2 1 0 0 1 A 0 0 1 3 0 0 2 0 3 2 1 0 T 0 0 0 3 0 0 1 3 2 2 1 2 A 0 0 0 3 0 0 2 2 4 3 2 1
Databases and Multiple Sequences BLAST BLASTP, BLASTN, BLASTX, PSI-BLAST FASTA FASTX Multiple Sequence Alignments CLUSTAL Algorithm