Multiple sequence alignment (MSA)

From pairwise to multiple A T _ A T C A... A _ C A T _ A... A T _ G C G _... A _ C G T _ A... A T C A C _ A... _ T C G A G A... Relationship of sequences (Tree)

NODE : a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor (unknown species : represents the ancestor of 2 or more species). BRANCH : defines the relationship between the taxa in terms of descent and ancestry. TOPOLOGY : is the branching pattern. BRANCH LENGTH : often represents the number of changes that have occurred in that branch. ROOT : is the common ancestor of all taxa. https://users.ugent.be/~avierstr/principles/phylogeny.html DISTANCE SCALE : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % differences between two sequences)

MSA is useful for bioinformatics Phylogenetic tree Motifs Structure prediction (RNA, protein) Gene Logo Conserved sequence elements

ClustalW procedure The progressive method e.g. ClustalW Step 1.) Pairwise alignments Step 2.) Build guide tree Step 3.) Progressive alignment guided by the tree

http://ai.stanford.edu/~chuongdo/papers/alignment_review.pdf

The BLOSUM62 matrix Step1. Pairwise alignments Pairwise sequence alignments Scoring matrix Gap penalties Global/Local alignments

Step 2. Build guide tree Neighbor-Joining Algorithm UPGMA 1 3 2 4

Neighbor-Joining Algorithm Step1: 準備三個 matrix: P T Q A B C D 0 + 8 + 4 +6 = 18 A B C D A 0 8 4 6 A 18 B 8 0 8 8 B 24 C 4 8 0 6 C 18 D 6 8 6 0 D 20 P TotalDistance T Q i,j = (n-2)*p i,j - T i - T j A B C D A 0-26 -28-26 B -26 0-26 -28 C -28-26 0-26 D -26-28 -26 0 Q Q i,j = (4-2)*6-18 - 20 = -26 https://www.youtube.com/watch?v=agsudxq7gp8

Neighbor-Joining Algorithm Step2: 從 Q 找出最小的值, 選其中一組出來合併 A B A B C D A 0 8 4 6 A 18 A B C D A 0-26 -28-26 B 8 0 8 8 B 24 B -26 0-26 -28 C D C 4 8 0 6 D 6 8 6 0 C 18 D 20 C -28-26 0-26 D -26-28 -26 0 P T Q A,C B,D https://www.youtube.com/watch?v=agsudxq7gp8

Neighbor-Joining Algorithm Step3: 合併 A,C, 產生新的 P T Q A 2 B A,C B A,C B D A,C B A,C B A,C B D C D D D P T D Q https://www.youtube.com/watch?v=agsudxq7gp8 1) 從前一個 P T, 計算 A,C 的距離 : n = 4 ( 因為 P 是 4x4 的 matrix) 4/2 + (18-18)/2(n-2) = 2 2) 把前一個 P 的對應值 ( 有跟 A,C 相關的 ) 都減去 2, 其他不受影響的值不變產生新的 P 再由新的 P 計算出新的 T Q

Neighbor-Joining Algorithm Step3: 合併 A,C, 產生新的 P T Q A 2 B A,C B D A,C 0 6 4 B 6 0 8 A,C 10 B 14 A,C B D A,C 0-18 -18 B -18 0-18 C D D 4 8 0 D 12 P T D -18-18 0 Q https://www.youtube.com/watch?v=agsudxq7gp8 1) 從前一個 P T, 計算 A,C 的距離 : n = 4 ( 因為 P 是 4x4 的 matrix) 4/2 + (18-18)/2(n-2) = 2 2) 把前一個 P 的對應值 ( 有跟 A,C 相關的 ) 都減去 2, 其他不受影響的值不變產生新的 P 再由新的 P 計算出新的 T Q

A 2 B C D A 2 B 2 4-2 =2 C D

重複 Step2, 合併 A,C,B 請同學算一次新的 P T Q

UPGMA Algorithm A B C D E A 0 8 4 6 8 B 8 0 8 8 4 C 4 8 0 6 8 D 6 8 6 0 8 E 8 4 8 8 0 A B C D E https://www.youtube.com/watch?v=c2y9s_e2184

UPGMA Algorithm A,C B D E A B C D E A 0 8 4 6 8 B 8 0 8 8 4 C 4 8 0 6 8 找出最小的值, 選其中一組出來合併合併 A,C ( 距離直接對分 4/2=2) A,C 0 8 6 8 B 8 0 8 4 D 6 8 0 8 E 8 4 8 0 D 6 8 6 0 8 E 8 4 8 8 0 C A 2 2 B D E A B C D E https://www.youtube.com/watch?v=c2y9s_e2184

UPGMA Algorithm A,C B D E A B C D E A 0 8 4 6 8 B 8 0 8 8 4 C 4 8 0 6 8 找出最小的值, 選其中一組出來合併合併 A,C ( 距離直接對分 4/2=2) A,C 0 8 6 8 B 8 0 8 4 D 6 8 0 8 E 8 4 8 0 D 6 8 6 0 8 E 8 4 8 8 0 C A 2 2 B D E A B C D E 選擇最小的合併合併 B 與 E ( 距離直接對分 4/2=2) A,C B,E D A,C 0 8 6 B,E 8 0 8 D 6 8 0 C A 2 2 B E 2 2 D https://www.youtube.com/watch?v=c2y9s_e2184

UPGMA 有時候有二組以上最小的圖片出處 https://www.youtube.com/watch?v=c2y9s_e2184

UPGMA 有時 branch length 無法分配完美圖片出處 https://www.youtube.com/watch?v=c2y9s_e2184

Step3. progressive alignment 1 2 1 3 1 4 2 3 2 4 1 3 2 4 3 4

Problems with progressive alignments Dependence of the initial pair-wise sequence alignment. Propagating errors form initial alignments.

Example This and next figures examples are from T-coffee paper: Noterdame, Higgins, Heringa, JMB 2000, 302 205-217

MUSCLE Robert C. Edgar* Nucleic Acids Research, 2004, Vol. 32, No. 5 1792-1797 There are three main stages: Stage 1. draft progressive Stage 2. improved progressive Stage 3. refinement

Robert C. Edgar* Nucleic Acids Research, 2004, Vol. 32, No. 5 1792-1797

http://ai.stanford.edu/~chuongdo/papers/alignment_review.pdf

https://www.ebi.ac.uk/tools/msa/

MUSCLE https://www.ebi.ac.uk/tools/msa/muscle/

MAFFT https://www.ebi.ac.uk/tools/msa/mafft/

Newick format 1 1 1 2 3 (B,(A,C,E),D); (B:2,(A:1,C:1,E:1),D:3); More detailed: http://evolution.genetics.washington.edu/phylip/newicktree.html

MEGA https://www.megasoftware.net/

Homework APOBEC ("apolipoprotein B mrna editing enzyme, catalytic polypeptide-like") is a family of evolutionarily conserved cytidine deaminases. A mechanism of generating protein diversity is mrna editing. Members of this family are C-to-U editing enzymes. The N-terminal domain of APOBEC like proteins is the catalytic domain, while the C-terminal domain is a pseudocatalytic domain. More specifically, the catalytic domain is a zinc dependent cytidine deaminase domain and is essential for cytidine deamination. RNA editing by APOBEC-1 requires homodimerisation and this complex interacts with RNA binding proteins to form the editosome. In humans/mammals they help protect from viral infections. [3] These enzymes, when misregulated, are a major source of mutation in numerous cancer types. (...from wiki) https://en.wikipedia.org/wiki/apobec http://cgmmrc.cgu.edu.tw/files/14-1064-53824,r34-1.php?lang=zh-tw

Chen et al., APOBEC3A is an oral cancer prognostic biomarker in Taiwanese carriers of an APOBEC deletion polymorphism. Nature Communications 8:465, 2017

請用 Microsoft Word 或 PDF 格式編輯作業, 檔案名稱請用學號 _ 姓名例 : u9934123_ 姓名繳交時間 : 3/28 15:30 前上傳至 ilms (1) How many human APOBEC family deposited in NCBI? (2) Do MSA and Tree visualization analysis using MAFFT, T-coffee, and MUSCLE. (DNA) (3) Are the MSA results the same? If not, why?