Introduction to Sequence Analysis

Size: px
Start display at page:

Download "Introduction to Sequence Analysis"

Transcription

1 References Introducton to Seuence Analyss Chaters 2 & 7 of Bologcal Seuence Analyss (Durbn et al., 2001) Utah State Unversty Srng 2012 STAT 5570: Statstcal Bonformatcs Notes Revew Genes are: - seuences of DNA that do somethng - can be exressed as a strng of: nuclec acds: A,C,G,T (4-letter alhabet) Central Dogma of Molecular Bology DNA mrna roten bo. acton rotens can be exressed as a strng of: amno acds: (20-letter alhabet) (sometme 24 due to smlartes ) Why look at roten seuence? Levels of roten structure rmary structure: order of amno acds Secondary structure: reeatng structures (beta-sheets and alha-helces) n backbone Tertary structure: full three-dmensonal folded structure Quartenary structure: nteracton of multle backbones Seuence shae functon Smlar seuence smlar functon -? 3 4

2 Consder smle arwse algnment Seuence 1: HEAGAWGHEE Seuence 2: AWHEAE ossble algnments Seuence 1: HEAGAWGHEE Seuence 2: AWHEAE How smlar are these two seuences? Match u exactly? Subseuences smlar? Whch ostons could be ossbly matched wthout severe enalty? Algnment 1: HEAGAWGHEE AWHEAE Algnment 2: HEAGAWGHEE AW-HE-AE Algnment 3: HEA-GAWGHEE AWHEAE Algnment 4: HEAGAWGHE-E AW-HEAE To fnd the best algnment, need some way to: rate algnments Thnk of gas n algnment as: mutatonal nserton or deleton 5 6 Basc dea of scorng otental algnments + score: denttes and conservatve substtutons - score: non- conservatve changes - (not exected n real algnments) Add score at each oston Euvalent to assumng mutatons are: ndeendent Reasonable assumton for DNA and rotens but not structural RNA s Some Notaton a ab { a, b from common ancestor} Let x be seuence 1, and Random Model : Matched Model : assume ndeendence of seuences fre. of letter a n seuence, x, y R y be seuence 2. x, y M x j x y assume resdues a & b are algned as a ar wth rob. ab y j 7 8

3 Comare these two models Odds Rato : Need : ab x, y M x, y R Log Odds Rato : S x s( x, y ), x y ab where s( a, b) log a y b log lkelhood rato of ar (a,b) occurrng as algned ar, as oosed to unalgned ar Score Matrx or substtuton matrx A R N D... Y V A R N D s(a,b) Y V 0 3 Ths s a orton of the BLOSUM50 substtuton matrx; others exst. These are scaled and rounded log-odds values (for comutatonal effcency) 9 10 How to get these substtuton values? Basc dea: Look at exstng, known algnments Comare seuences of algned rotens and look at substtuton freuences Ths s a chcken-or-the-egg roblem: - algnment - - scorng scheme - Maybe better to base algnment on: tertary structures (or some other algnment) Some substtuton matrx tyes BLOSUM (Henkoff) BLOCK substtuton matrx derved from BLOCKS database set of algned ungaed roten famles, clustered accordng to threshold ercentage (L) of dentcal resdues comare resdue freuences between clusters L=50 BLOSUM50 AM (Dayhoff) ercentage of accetable ont mutatons er 10 8 years derved from a general model for roten evoluton, based on number L of AMs (evolutonary dstance) AM1 from comarng seuences wth <1% dvergence L=250 AM250 = AM1^

4 Whch substtuton matrx to use? No unversal best way In general: low AM fnd short algnments of smlar se. hgh AM fnd longer, weaker local algnments BLOSUM standards: BLOSUM50 for algnment wth gas BLOSUM62 for ungaed algnments hgher AM, lower BLOSUM more dvergent (lookng for more dstantly related rotens) A reasonable strategy: BLOSUM62 comlemented wth AM250 Whch matrx for algnng DNA seuences? The BLOSUM and AM matrces are based on smlartes between amno acds - no such smlarty assumed for nuclec acds; resdues ether match or they don t Untary matrx: dentty matrx +1 for dentcal match (or +3 or ) 0 for non-match (or -2 or ) How to score gas? Tabular reresentaton of algnment One way: affne ga enalty length of ga lnear transformaton followed by a translaton ( g ) d ( g 1) e ga oenng enalty ga extenson enalty (e < d) Thnk of gas n algnment as: mutatonal nserton or deleton start wth 0 0 A W H E A E H E A G A W G H E E begn (or contnue) ga: -d (or -e) match letters (resdues): + s(a,b) Fll n table to gve max. of ossble values at each successve element kee track of whch drecton generated max. then use the ath that gves hghest fnal score (lower rght corner) 15 16

5 Algnment algorthms Global: Needleman-Wunsch - fnd otmal algnment for entre seuences (rev. slde) Local: Smth-Waterman - fnd otmal algnment for subseuences Reeated matches - allow for startng over seuences (fnd motfs n long seuences) Overla matches - allow for one seuence to contan or overla the other (for comarng fragments) Heurstc: BLAST, FASTA - for comarng a sngle seuence aganst a large database of seuences Comare global and local algnments Seuence 1: HEAGAWGHEE Seuence 2: AWHEAE Global arwse Algnment (1 of 1) attern: [1] HEAGAWGHE-E subject: [1] ---AW-HEAE score: 23 Local arwse Algnment (1 of 1) attern: [5] AWGHE-E subject: [2] AW-HEAE score: Smle arwse algnment n R lbrary(bostrngs) # Defne seuences se1 <- "HEAGAWGHEE" se2 <- "AWHEAE" # erform global algnment g.algn <- arwsealgnment(se1, se2, substtutonmatrx='blosum50', gaoenng=-4, gaextenson=-1, tye='global') g.algn # erform local algnment l.algn <- arwsealgnment(se1, se2, substtutonmatrx='blosum50', gaoenng=-4, gaextenson=-1, tye='local') l.algn Look at a bgger examle The arsesm ackage (not n current Boconductor) has a comanon fle (ex.fasta) wth seuence data for 67 roten seuences n FASTA format: >At1g01010 NAC doman roten, utatve MEDQVGFGFRNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDWNLRFQSKYKSRD... VISWIILVG >At1g01020 unknown roten MAASEHRCVGCGFRVKSLFIQYSGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRK VYRHVLYNAINATVNIQHLLWKLVFAYLLLDCYRSLLLRKSDEESSFSDSVLLSIKVR SFLFNGLN >At1g01030 DNA-bndng roten, utatve MDLSLATTTTSSDQEQDRDQELTSNIGASSSSGSGNNNNLMMMIEKEHMFDKVV... EESWLVRGEIGASSSSSSALRLNLSTDHDDDNDDGDDGDDDQFAKKGKSSLSLNFN >At1g01040 CAF roten MVMEDEREATIKSYWLDACEDISCDLIDDLVSEFDSSVAVNESTDENGVINDFFGGI... DKDRKRARVCSYQSERSNLSGRGHVNNSREGDRFMNRKRTRNWDEAGNNKKKRECNNYRR htt:// 20

6 Bgger examle: For a gven seuence (subject), "At1g01010 NAC doman roten, utatve" fnd the most smlar seuence n a lst (attern) "At1g01190 cytochrome 450, utatve" Global arwse Algnment (1 of 1) attern: [1] MRTEIESLWVF-----ALASKFNIYMQQHFASLL---VAIAITWFTITI... subject: [1] MEDQVG--FGFRNDEELVGH---YLRNKIEGNTSRDVEVAIS EVNIC... score: 313 # read n data n FASTA format f1 <- "C://folder//ex.fasta" # fle saved from webste (slde 20) ff <- read.aastrngset(f1, "fasta") # comare frst seuence (subject) wth the others (attern) sub <- ff[1] names(sub) # "At1g01010 NAC doman roten, utatve" at <- ff[2:length(ff)] # get scores of all global algnments s <- arwsealgnment(at, sub, substtutonmatrx='am250', gaoenng=-4, gaextenson=-1, tye='global', scoreonly=true) hst(s, man=c('global algnment scores wth',names(sub))) # look at best algnment k <- whch.max(s) names(at[k]) # "At1g01190 cytochrome 450, utatve" arwsealgnment(at[k], sub, substtutonmatrx='am250', gaoenng=-4, gaextenson=-1, tye='global') (names refer to gene name or locus) hylogenetc trees ntro & motvaton hylogeny: relatonsh among seces hylogenetc tree: vsualzaton of hylogeny (usually a dendrogram) How can we do ths here? Consder multle seuences (maybe from dfferent seces) Smlar seuences are called homologues - descended from common ancestor seuence? - smlar functon? Want to vsualze these relatonshs Quck revew of agglomeratve clusterng - defne dstance between onts - each ont (seuence here) starts as ts own cluster - fnd closest clusters and merge them - Lnkage: how to defne dstance between new cluster and exstng clusters 23 24

7 Recall lnkage methods (a few) Defnng dstance between seuences & j Let,, be clusters, d d be the dstance, be thedstance between and the new, cluster, and n be thenumber of onts n cluster. Sngle (nearest neghbor) : d mn d Average : d Ward : d n n d n n d nd UGMA : d n d n n d n / 2 n n d, d n d Why not Eucldean, earson, etc.? - seuences are not onts n sace Could use (after arwse algnment): 1 normalzed score {score (or 0) dvded by smaller selfscore} 1 %dentty based on length of shorter seuence 1 %smlarty Makng use of models for resdue substtuton (for DNA): Let f = fracton of stes n arwse algnment where resdues dffer = 1 - %dentty Jukes-Cantor dstance: d j log1 4 f / Vsualze relatonshs among 11 seuences from ex.fasta fle # Functon to get hylogenetc dstance matrx for multle seuences # -- don't worry about syntax here; just see next slde for usage get.hylo.dst <- functon(ses,subm='blosum62',oen=-4,ext=-1,tye='local') { # Get matrx of arwse local algnment scores num.se <- length(ses) s.mat <- matrx(ncol=num.se, nrow=num.se) for( n 1:num.se) { for(j n :num.se) { s.mat[,j] <- s.mat[j,] <- arwsealgnment(ses[], ses[j], substtutonmatrx=subm, gaoenng=oen, gaextenson=ext, tye=tye, scoreonly=true) } } # Convert scores to normalzed scores norm.mat <- matrx(ncol=num.se, nrow=num.se) for( n 1:num.se) { for(j n :num.se) { mn.self <- mn(s.mat[,],s.mat[j,j]) norm.mat[,j] <- norm.mat[j,] <- s.mat[,j]/mn.self } norm.mat[,] <- 0 } } # Return dstance matrx colnames(norm.mat) <- rownames(norm.mat) <- substr(names(ses),1,9) return(as.dst(1-norm.mat)) 27 28

8 R code for hylogenetc trees from arwse dstances # Choose seuences ses <- ff[50:60] # recall ff object from slde 22 # hylogenetc tree dmat <- get.hylo.dst(ses,subm='blosum62',tye='local') lot(hclust(dmat,method="average"),man='hylogenetc Tree', xlab='normalzed Score') # heatma reresentaton lbrary(cluster) lbrary(rcolorbrewer) hmcol <- colorramalette(brewer.al(10,"uor"))(256) hclust.ave <- functon(d){hclust(d,method="average")} heatma(as.matrx(dmat),sym=true,col=hmcol, cexrow=4,cexcol=1,hclustfun=hclust.ave) Asde: vsualzng seuence content tab <- table(strslt(as.character(ff[1]),"")) use.col <- re('yellow',length(tab)) t <- names(tab)=='s' use.col[t] <- 'blue' barlot(tab,col=use.col,man=names(ff[1])) robably more useful for: assessng C-G counts n DNA seuences lbrary(affy); lbrary(hgu95av2.db); lbrary(annotate) GI <- as.lst(hgu95av2accnum) n.gi <- names(gi) t <- n.gi=="1950_s_at" se <- getseq(gi[t]) tab <- table(strslt(se,"")) use.col <- re('yellow', length(tab)) t <- names(tab)=='g' use.col[t] <- 'blue' barlot(tab,col=use.col, man="seuence content of 1950_s_at on hgu95av2") Summary Look at seuence smlarty to fnd: functonal smlarty -? arwse algnment bascs Scorng matrx BLOSUM, AM, etc. Algnment algorthm global, local, etc. Comng u: searchng onlne databases (BLAST) multle algnments attern (motf) fndng usng seuencng to measure gene exresson 31 32

be the i th symbol in x and

be the i th symbol in x and 2 Parwse Algnment We represent sequences b strngs of alphetc letters. If we recognze a sgnfcant smlart between a new sequence and a sequence out whch somethng s alread know, we can transfer nformaton out

More information

Search sequence databases 2 10/25/2016

Search sequence databases 2 10/25/2016 Search sequence databases 2 10/25/2016 The BLAST algorthms Ø BLAST fnds local matches between two sequences, called hgh scorng segment pars (HSPs). Step 1: Break down the query sequence and the database

More information

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh Computatonal Bology Lecture 8: Substtuton matrces Saad Mnemneh As we have ntroduced last tme, smple scorng schemes lke + or a match, - or a msmatch and -2 or a gap are not justable bologcally, especally

More information

Download the files protein1.txt and protein2.txt from the course website.

Download the files protein1.txt and protein2.txt from the course website. Queston 1 Dot plots Download the fles proten1.txt and proten2.txt from the course webste. Usng the dot plot algnment tool http://athena.boc.uvc.ca/workbench.php?tool=dotter&db=poxvrdae, algn the proten

More information

Hidden Markov Model Cheat Sheet

Hidden Markov Model Cheat Sheet Hdden Markov Model Cheat Sheet (GIT ID: dc2f391536d67ed5847290d5250d4baae103487e) Ths document s a cheat sheet on Hdden Markov Models (HMMs). It resembles lecture notes, excet that t cuts to the chase

More information

Course organization. Part II: Algorithms for Network Biology (Week 12-16)

Course organization. Part II: Algorithms for Network Biology (Week 12-16) Course organzaton Introducton Week 1-2) Course ntroducton A bref ntroducton to molecular bology A bref ntroducton to sequence comparson Part I: Algorthms for Sequence Analyss Week 3-11) Chapter 1-3 Models

More information

Algorithms for factoring

Algorithms for factoring CSA E0 235: Crytograhy Arl 9,2015 Instructor: Arta Patra Algorthms for factorng Submtted by: Jay Oza, Nranjan Sngh Introducton Factorsaton of large ntegers has been a wdely studed toc manly because of

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Multple sequence algnment Parwse sequence algnment ( and ) Substtuton matrces Database searchng Maxmum Lelhood Estmaton Observaton: Data, D (HHHTHHTH) What process generated ths data? Alternatve hypothess:

More information

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 493 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces you have studed thus far n the text are real vector spaces because the scalars

More information

Clustering gene expression data & the EM algorithm

Clustering gene expression data & the EM algorithm CG, Fall 2011-12 Clusterng gene expresson data & the EM algorthm CG 08 Ron Shamr 1 How Gene Expresson Data Looks Entres of the Raw Data matrx: Rato values Absolute values Row = gene s expresson pattern

More information

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms Course organzaton 1 Introducton Week 1-2) Course ntroducton A bref ntroducton to molecular bology A bref ntroducton to sequence comparson Part I: Algorthms for Sequence Analyss Week 3-8) Chapter 1-3 Models

More information

Pattern Classification

Pattern Classification attern Classfcaton All materals n these sldes were taken from attern Classfcaton nd ed by R. O. Duda,. E. Hart and D. G. Stork, John Wley & Sons, 000 wth the ermsson of the authors and the ublsher Chater

More information

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE II LECTURE - GENERAL LINEAR HYPOTHESIS AND ANALYSIS OF VARIANCE Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 3.

More information

Split alignment. Martin C. Frith April 13, 2012

Split alignment. Martin C. Frith April 13, 2012 Splt algnment Martn C. Frth Aprl 13, 2012 1 Introducton Ths document s about algnng a query sequence to a genome, allowng dfferent parts of the query to match dfferent parts of the genome. Here are some

More information

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

p 1 c 2 + p 2 c 2 + p 3 c p m c 2 Where to put a faclty? Gven locatons p 1,..., p m n R n of m houses, want to choose a locaton c n R n for the fre staton. Want c to be as close as possble to all the house. We know how to measure dstance

More information

Mechanics Physics 151

Mechanics Physics 151 Mechancs hyscs 151 Lecture Canoncal Transformatons (Chater 9) What We Dd Last Tme Drect Condtons Q j Q j = = j, Q, j, Q, Necessary and suffcent j j for Canoncal Transf. = = j Q, Q, j Q, Q, Infntesmal CT

More information

Machine Learning. Measuring Distance. several slides from Bryan Pardo

Machine Learning. Measuring Distance. several slides from Bryan Pardo Machne Learnng Measurng Dstance several sldes from Bran Pardo 1 Wh measure dstance? Nearest neghbor requres a dstance measure Also: Local search methods requre a measure of localt (Frda) Clusterng requres

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Introducton to Bonformatcs BINF 630 r.. Andrew Carr Multple Sequence Algnments Multple Sequence Algnment Fgure: Conserved catalytc motfs n the caspase-le superfamly of proteases. 2003 by Kluwer Academc

More information

( ) 2 ( ) ( ) Problem Set 4 Suggested Solutions. Problem 1

( ) 2 ( ) ( ) Problem Set 4 Suggested Solutions. Problem 1 Problem Set 4 Suggested Solutons Problem (A) The market demand functon s the soluton to the followng utlty-maxmzaton roblem (UMP): The Lagrangean: ( x, x, x ) = + max U x, x, x x x x st.. x + x + x y x,

More information

Managing Capacity Through Reward Programs. on-line companion page. Byung-Do Kim Seoul National University College of Business Administration

Managing Capacity Through Reward Programs. on-line companion page. Byung-Do Kim Seoul National University College of Business Administration Managng Caacty Through eward Programs on-lne comanon age Byung-Do Km Seoul Natonal Unversty College of Busness Admnstraton Mengze Sh Unversty of Toronto otman School of Management Toronto ON M5S E6 Canada

More information

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD. Cluster Analyss Cluster Valdaton Determnng Number of Clusters 1 Cluster Valdaton The procedure of evaluatng the results of a clusterng algorthm s known under the term cluster valdty. How do we evaluate

More information

6. Hamilton s Equations

6. Hamilton s Equations 6. Hamlton s Equatons Mchael Fowler A Dynamcal System s Path n Confguraton Sace and n State Sace The story so far: For a mechancal system wth n degrees of freedom, the satal confguraton at some nstant

More information

Protein Structure Comparison

Protein Structure Comparison Proten Structure Comparson Proten Structure Representaton CPK: hard sphere model Ball-and-stck Cartoon Degrees of Freedom n Protens Bond length Dhedral angle 3 4 Bond angle + Proten Structure: Varables

More information

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing Machne Learnng 0-70/5 70/5-78, 78, Fall 008 Theory of Classfcaton and Nonarametrc Classfer Erc ng Lecture, Setember 0, 008 Readng: Cha.,5 CB and handouts Classfcaton Reresentng data: M K Hyothess classfer

More information

Mechanics Physics 151

Mechanics Physics 151 Mechancs Physcs 151 Lecture 22 Canoncal Transformatons (Chater 9) What We Dd Last Tme Drect Condtons Q j Q j = = j P, Q, P j, P Q, P Necessary and suffcent P j P j for Canoncal Transf. = = j Q, Q, P j

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms Desgn and Analyss of Algorthms CSE 53 Lecture 4 Dynamc Programmng Junzhou Huang, Ph.D. Department of Computer Scence and Engneerng CSE53 Desgn and Analyss of Algorthms The General Dynamc Programmng Technque

More information

Structure from Motion. Forsyth&Ponce: Chap. 12 and 13 Szeliski: Chap. 7

Structure from Motion. Forsyth&Ponce: Chap. 12 and 13 Szeliski: Chap. 7 Structure from Moton Forsyth&once: Chap. 2 and 3 Szelsk: Chap. 7 Introducton to Structure from Moton Forsyth&once: Chap. 2 Szelsk: Chap. 7 Structure from Moton Intro he Reconstructon roblem p 3?? p p 2

More information

Distance-Based Approaches to Inferring Phylogenetic Trees

Distance-Based Approaches to Inferring Phylogenetic Trees Dstance-Base Approaches to Inferrng Phylogenetc Trees BMI/CS 576 www.bostat.wsc.eu/bm576.html Mark Craven craven@bostat.wsc.eu Fall 0 Representng stances n roote an unroote trees st(a,c) = 8 st(a,d) =

More information

Problem Points Score Total 100

Problem Points Score Total 100 Physcs 450 Solutons of Sample Exam I Problem Ponts Score 1 8 15 3 17 4 0 5 0 Total 100 All wor must be shown n order to receve full credt. Wor must be legble and comprehensble wth answers clearly ndcated.

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Basic Regular Expressions. Introduction. Introduction to Computability. Theory. Motivation. Lecture4: Regular Expressions

Basic Regular Expressions. Introduction. Introduction to Computability. Theory. Motivation. Lecture4: Regular Expressions Introducton to Computablty Theory Lecture: egular Expressons Prof Amos Israel Motvaton If one wants to descrbe a regular language, La, she can use the a DFA, Dor an NFA N, such L ( D = La that that Ths

More information

Linear Classification, SVMs and Nearest Neighbors

Linear Classification, SVMs and Nearest Neighbors 1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush

More information

The Bellman Equation

The Bellman Equation The Bellman Eqaton Reza Shadmehr In ths docment I wll rovde an elanaton of the Bellman eqaton, whch s a method for otmzng a cost fncton and arrvng at a control olcy.. Eamle of a game Sose that or states

More information

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as:

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as: 1 Problem set #1 1.1. A one-band model on a square lattce Fg. 1 Consder a square lattce wth only nearest-neghbor hoppngs (as shown n the fgure above): H t, j a a j (1.1) where,j stands for nearest neghbors

More information

Bayesian classification CISC 5800 Professor Daniel Leeds

Bayesian classification CISC 5800 Professor Daniel Leeds Tran Test Introducton to classfers Bayesan classfcaton CISC 58 Professor Danel Leeds Goal: learn functon C to maxmze correct labels (Y) based on features (X) lon: 6 wolf: monkey: 4 broker: analyst: dvdend:

More information

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are

More information

Similarities Between Hidden Markov Models and Turing Machines, and Possible Applications Towards Bioinformatics

Similarities Between Hidden Markov Models and Turing Machines, and Possible Applications Towards Bioinformatics Bonformatcs Fnal Proect, Fall 2000 Smlartes Between Hdden Markov Models and Turng Machnes, and Possble Applcatons Towards Bonformatcs Tyler Cheung Over the past fve or sx years, Hdden Markov Models (HMMs)

More information

Spatial Statistics and Analysis Methods (for GEOG 104 class).

Spatial Statistics and Analysis Methods (for GEOG 104 class). Spatal Statstcs and Analyss Methods (for GEOG 104 class). Provded by Dr. An L, San Dego State Unversty. 1 Ponts Types of spatal data Pont pattern analyss (PPA; such as nearest neghbor dstance, quadrat

More information

Fuzzy approach to solve multi-objective capacitated transportation problem

Fuzzy approach to solve multi-objective capacitated transportation problem Internatonal Journal of Bonformatcs Research, ISSN: 0975 087, Volume, Issue, 00, -0-4 Fuzzy aroach to solve mult-objectve caactated transortaton roblem Lohgaonkar M. H. and Bajaj V. H.* * Deartment of

More information

Chapter 7 Clustering Analysis (1)

Chapter 7 Clustering Analysis (1) Chater 7 Clusterng Analyss () Outlne Cluster Analyss Parttonng Clusterng Herarchcal Clusterng Large Sze Data Clusterng What s Cluster Analyss? Cluster: A collecton of ata obects smlar (or relate) to one

More information

What Independencies does a Bayes Net Model? Bayesian Networks: Independencies and Inference. Quick proof that independence is symmetric

What Independencies does a Bayes Net Model? Bayesian Networks: Independencies and Inference. Quick proof that independence is symmetric Bayesan Networks: Indeendences and Inference Scott Daves and ndrew Moore Note to other teachers and users of these sldes. ndrew and Scott would be delghted f you found ths source materal useful n gvng

More information

CIS 700: algorithms for Big Data

CIS 700: algorithms for Big Data CIS 700: algorthms for Bg Data Lecture 5: Dmenson Reducton Sldes at htt://grgory.us/bg-data-class.html Grgory Yaroslavtsev htt://grgory.us Today Dmensonalty reducton AMS as dmensonalty reducton Johnson-Lndenstrauss

More information

Profile HMM for multiple sequences

Profile HMM for multiple sequences Profle HMM for multple sequences Par HMM HMM for parwse sequence algnment, whch ncorporates affne gap scores. Match (M) nserton n x (X) nserton n y (Y) Hdden States Observaton Symbols Match (M): {(a,b)

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Non-Ideality Through Fugacity and Activity

Non-Ideality Through Fugacity and Activity Non-Idealty Through Fugacty and Actvty S. Patel Deartment of Chemstry and Bochemstry, Unversty of Delaware, Newark, Delaware 19716, USA Corresondng author. E-mal: saatel@udel.edu 1 I. FUGACITY In ths dscusson,

More information

On the Dirichlet Mixture Model for Mining Protein Sequence Data

On the Dirichlet Mixture Model for Mining Protein Sequence Data On the Drchlet Mxture Model for Mnng Proten Sequence Data Xugang Ye Natonal Canter for Botechnology Informaton Bologsts need to fnd from the raw data lke ths Background Background the nformaton lke ths

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Statistical pattern recognition

Statistical pattern recognition Statstcal pattern recognton Bayes theorem Problem: decdng f a patent has a partcular condton based on a partcular test However, the test s mperfect Someone wth the condton may go undetected (false negatve

More information

Machine Perception of Music & Audio. Topic 9: Measuring Distance

Machine Perception of Music & Audio. Topic 9: Measuring Distance Machne Percepton of Musc & Audo Topc 9: Measurng Dstance Bran Pardo EECS 352 Wnter 2010 1 Wh measure dstance? Clusterng requres dstance measures. Local methods requre a measure of localt Search engnes

More information

Pattern Recognition. Approximating class densities, Bayesian classifier, Errors in Biometric Systems

Pattern Recognition. Approximating class densities, Bayesian classifier, Errors in Biometric Systems htt://.cubs.buffalo.edu attern Recognton Aromatng class denstes, Bayesan classfer, Errors n Bometrc Systems B. W. Slverman, Densty estmaton for statstcs and data analyss. London: Chaman and Hall, 986.

More information

Interpolated Markov Models for Gene Finding

Interpolated Markov Models for Gene Finding Interpolated Markov Models for Gene Fndng BMI/CS 776 www.bostat.wsc.edu/bm776/ Sprng 208 Anthony Gtter gtter@bostat.wsc.edu hese sldes, ecludng thrd-party materal, are lcensed under CC BY-NC 4.0 by Mark

More information

BIOINFORMATICS: PAST, PRESENT AND FUTURE. Susan R. Wilson Mathematical Sciences Institute, Australian National University, Australia

BIOINFORMATICS: PAST, PRESENT AND FUTURE. Susan R. Wilson Mathematical Sciences Institute, Australian National University, Australia BIOINFORMATICS: PAST, PRESENT AND FUTURE Susan R. Wlson Mathematcal Scences Insttute, Australan Natonal Unversty, Australa Keywords: Bonformatcs, bologcal sequence analyss, sequence algnment, hdden Markov

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

Poisson brackets and canonical transformations

Poisson brackets and canonical transformations rof O B Wrght Mechancs Notes osson brackets and canoncal transformatons osson Brackets Consder an arbtrary functon f f ( qp t) df f f f q p q p t But q p p where ( qp ) pq q df f f f p q q p t In order

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Analyss of Varance and Desgn of Exerments-I MODULE III LECTURE - 2 EXPERIMENTAL DESIGN MODELS Dr. Shalabh Deartment of Mathematcs and Statstcs Indan Insttute of Technology Kanur 2 We consder the models

More information

THERMODYNAMICS. Temperature

THERMODYNAMICS. Temperature HERMODYNMICS hermodynamcs s the henomenologcal scence whch descrbes the behavor of macroscoc objects n terms of a small number of macroscoc arameters. s an examle, to descrbe a gas n terms of volume ressure

More information

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule: 15-745 Lecture 6 Data Dependence n Loops Copyrght Seth Goldsten, 2008 Based on sldes from Allen&Kennedy Lecture 6 15-745 2005-8 1 Common loop optmzatons Hostng of loop-nvarant computatons pre-compute before

More information

Understanding Cellular Systems Using Genome Data

Understanding Cellular Systems Using Genome Data Understandng Cellular Systems Usng Genome Data "@? Km Reynolds, UT Southwestern, Sept. 2014 Why s ths problem hard? Detaled nowledge of the molecular players an apparently dense, nterconnected networ.

More information

On the Repeating Group Finding Problem

On the Repeating Group Finding Problem The 9th Workshop on Combnatoral Mathematcs and Computaton Theory On the Repeatng Group Fndng Problem Bo-Ren Kung, Wen-Hsen Chen, R.C.T Lee Graduate Insttute of Informaton Technology and Management Takmng

More information

The Study of Teaching-learning-based Optimization Algorithm

The Study of Teaching-learning-based Optimization Algorithm Advanced Scence and Technology Letters Vol. (AST 06), pp.05- http://dx.do.org/0.57/astl.06. The Study of Teachng-learnng-based Optmzaton Algorthm u Sun, Yan fu, Lele Kong, Haolang Q,, Helongang Insttute

More information

Introduction to Algorithms

Introduction to Algorithms Introducton to Algorthms 6.046J/8.40J Lecture 7 Prof. Potr Indyk Data Structures Role of data structures: Encapsulate data Support certan operatons (e.g., INSERT, DELETE, SEARCH) Our focus: effcency of

More information

The Dirac Equation for a One-electron atom. In this section we will derive the Dirac equation for a one-electron atom.

The Dirac Equation for a One-electron atom. In this section we will derive the Dirac equation for a one-electron atom. The Drac Equaton for a One-electron atom In ths secton we wll derve the Drac equaton for a one-electron atom. Accordng to Ensten the energy of a artcle wth rest mass m movng wth a velocty V s gven by E

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Substitution Matrices and Alignment Statistics. Substitution Matrices

Substitution Matrices and Alignment Statistics. Substitution Matrices Susttuton Mtrces nd Algnment Sttstcs BMI/CS 776 www.ostt.wsc.edu/~crven/776.html Mrk Crven crven@ostt.wsc.edu Ferur 2002 Susttuton Mtrces two oulr sets of mtrces for roten seuences PAM mtrces [Dhoff et

More information

Body Models I-2. Gerard Pons-Moll and Bernt Schiele Max Planck Institute for Informatics

Body Models I-2. Gerard Pons-Moll and Bernt Schiele Max Planck Institute for Informatics Body Models I-2 Gerard Pons-Moll and Bernt Schele Max Planck Insttute for Informatcs December 18, 2017 What s mssng Gven correspondences, we can fnd the optmal rgd algnment wth Procrustes. PROBLEMS: How

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and

More information

GenCB 511 Coarse Notes Population Genetics NONRANDOM MATING & GENETIC DRIFT

GenCB 511 Coarse Notes Population Genetics NONRANDOM MATING & GENETIC DRIFT NONRANDOM MATING & GENETIC DRIFT NONRANDOM MATING/INBREEDING READING: Hartl & Clark,. 111-159 Wll dstngush two tyes of nonrandom matng: (1) Assortatve matng: matng between ndvduals wth smlar henotyes or

More information

Digital PI Controller Equations

Digital PI Controller Equations Ver. 4, 9 th March 7 Dgtal PI Controller Equatons Probably the most common tye of controller n ndustral ower electroncs s the PI (Proortonal - Integral) controller. In feld orented motor control, PI controllers

More information

Naïve Bayes Classifier

Naïve Bayes Classifier 9/8/07 MIST.6060 Busness Intellgence and Data Mnng Naïve Bayes Classfer Termnology Predctors: the attrbutes (varables) whose values are used for redcton and classfcaton. Predctors are also called nut varables,

More information

MEM 255 Introduction to Control Systems Review: Basics of Linear Algebra

MEM 255 Introduction to Control Systems Review: Basics of Linear Algebra MEM 255 Introducton to Control Systems Revew: Bascs of Lnear Algebra Harry G. Kwatny Department of Mechancal Engneerng & Mechancs Drexel Unversty Outlne Vectors Matrces MATLAB Advanced Topcs Vectors A

More information

PHYS 705: Classical Mechanics. Newtonian Mechanics

PHYS 705: Classical Mechanics. Newtonian Mechanics 1 PHYS 705: Classcal Mechancs Newtonan Mechancs Quck Revew of Newtonan Mechancs Basc Descrpton: -An dealzed pont partcle or a system of pont partcles n an nertal reference frame [Rgd bodes (ch. 5 later)]

More information

Lecture 6 More on Complete Randomized Block Design (RBD)

Lecture 6 More on Complete Randomized Block Design (RBD) Lecture 6 More on Complete Randomzed Block Desgn (RBD) Multple test Multple test The multple comparsons or multple testng problem occurs when one consders a set of statstcal nferences smultaneously. For

More information

Sequence Analysis. Example of nucleotide sequence database entry for Genbank

Sequence Analysis. Example of nucleotide sequence database entry for Genbank //8 E N T R E F O R I N T E G R T I V E B I O I N F O R M T I S V U [] Substtuton matrces Seuence analyss 6 [] Substtuton matrces Seuence analyss 6 Seuence nalyss Fndng relatonshps between genes and gene

More information

2-Adic Complexity of a Sequence Obtained from a Periodic Binary Sequence by Either Inserting or Deleting k Symbols within One Period

2-Adic Complexity of a Sequence Obtained from a Periodic Binary Sequence by Either Inserting or Deleting k Symbols within One Period -Adc Comlexty of a Seuence Obtaned from a Perodc Bnary Seuence by Ether Insertng or Deletng Symbols wthn One Perod ZHAO Lu, WEN Qao-yan (State Key Laboratory of Networng and Swtchng echnology, Bejng Unversty

More information

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Grover s Algorithm + Quantum Zeno Effect + Vaidman Grover s Algorthm + Quantum Zeno Effect + Vadman CS 294-2 Bomb 10/12/04 Fall 2004 Lecture 11 Grover s algorthm Recall that Grover s algorthm for searchng over a space of sze wors as follows: consder the

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ]

Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ] 1 Prolem (10P) Show that f X s a random varale, then E[X] = arg mn E[(X ) 2 ] Thus a good predcton for X s E[X] f the squared dfference s used as the metrc. The followng rules are used n the proof: 1.

More information

Message modification, neutral bits and boomerangs

Message modification, neutral bits and boomerangs Message modfcaton, neutral bts and boomerangs From whch round should we start countng n SHA? Antone Joux DGA and Unversty of Versalles St-Quentn-en-Yvelnes France Jont work wth Thomas Peyrn 1 Dfferental

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

C-wave event automated registration using a nonlinear global search method

C-wave event automated registration using a nonlinear global search method C-wave event automated regstraton usng a nonlnear global search method Shuangquan Chen*,1, Xang-Yang L 1,2 and Xaomng L 1 1 CNPC Keylab of Geophyscal Prospectng, Chna Unversty of Petroleum, Bejng, 102249,

More information

Lecture Nov

Lecture Nov Lecture 18 Nov 07 2008 Revew Clusterng Groupng smlar obects nto clusters Herarchcal clusterng Agglomeratve approach (HAC: teratvely merge smlar clusters Dfferent lnkage algorthms for computng dstances

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models I529: Machne Learnng n Bonformatcs (Sprng 217) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 217 Outlne Smple model (frequency & profle) revew Markov chan

More information

The Second Anti-Mathima on Game Theory

The Second Anti-Mathima on Game Theory The Second Ant-Mathma on Game Theory Ath. Kehagas December 1 2006 1 Introducton In ths note we wll examne the noton of game equlbrum for three types of games 1. 2-player 2-acton zero-sum games 2. 2-player

More information

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso Supplement: Proofs and Techncal Detals for The Soluton Path of the Generalzed Lasso Ryan J. Tbshran Jonathan Taylor In ths document we gve supplementary detals to the paper The Soluton Path of the Generalzed

More information

Logistic regression with one predictor. STK4900/ Lecture 7. Program

Logistic regression with one predictor. STK4900/ Lecture 7. Program Logstc regresson wth one redctor STK49/99 - Lecture 7 Program. Logstc regresson wth one redctor 2. Maxmum lkelhood estmaton 3. Logstc regresson wth several redctors 4. Devance and lkelhood rato tests 5.

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Confidence intervals for weighted polynomial calibrations

Confidence intervals for weighted polynomial calibrations Confdence ntervals for weghted olynomal calbratons Sergey Maltsev, Amersand Ltd., Moscow, Russa; ur Kalambet, Amersand Internatonal, Inc., Beachwood, OH e-mal: kalambet@amersand-ntl.com htt://www.chromandsec.com

More information

Negative Binomial Regression

Negative Binomial Regression STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...

More information

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads

More information

Supplementary Material for Spectral Clustering based on the graph p-laplacian

Supplementary Material for Spectral Clustering based on the graph p-laplacian Sulementary Materal for Sectral Clusterng based on the grah -Lalacan Thomas Bühler and Matthas Hen Saarland Unversty, Saarbrücken, Germany {tb,hen}@csun-sbde May 009 Corrected verson, June 00 Abstract

More information

An Introduction to Morita Theory

An Introduction to Morita Theory An Introducton to Morta Theory Matt Booth October 2015 Nov. 2017: made a few revsons. Thanks to Nng Shan for catchng a typo. My man reference for these notes was Chapter II of Bass s book Algebrac K-Theory

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Advanced Topics in Optimization. Piecewise Linear Approximation of a Nonlinear Function

Advanced Topics in Optimization. Piecewise Linear Approximation of a Nonlinear Function Advanced Tocs n Otmzaton Pecewse Lnear Aroxmaton of a Nonlnear Functon Otmzaton Methods: M8L Introducton and Objectves Introducton There exsts no general algorthm for nonlnear rogrammng due to ts rregular

More information

MODELING TRAFFIC LIGHTS IN INTERSECTION USING PETRI NETS

MODELING TRAFFIC LIGHTS IN INTERSECTION USING PETRI NETS The 3 rd Internatonal Conference on Mathematcs and Statstcs (ICoMS-3) Insttut Pertanan Bogor, Indonesa, 5-6 August 28 MODELING TRAFFIC LIGHTS IN INTERSECTION USING PETRI NETS 1 Deky Adzkya and 2 Subono

More information