Microarray technology. Supervised learning and analysis of microarray data. Microarrays. Affymetrix arrays. Two computational tasks

Size: px

Start display at page:

Download "Microarray technology. Supervised learning and analysis of microarray data. Microarrays. Affymetrix arrays. Two computational tasks"

Eleanore Rodgers
5 years ago
Views:

Supervsed learnng and analss of croarra data Devka Suraanan Cop 47 Mcroarra technolog Quck recap Protens: state of cell Gene: codes for a proten RNA: helps assele a proten RNA levels ~ gene ep.

pcal eperent: Measure epresson of genes under dfferent condtons and ask hat s dfferent at a olecular level and h.

1 Supervsed learnng and analss of croarra data Devka Suraanan Cop 47 Mcroarra technolog Quck recap Protens: state of cell Gene: codes for a proten RNA: helps assele a proten RNA levels ~ gene ep. level ~ proten levels Mcroarras easure the epresson levels of thousands of genes at a te. pcal eperent: Measure epresson of genes under dfferent condtons and ask hat s dfferent at a olecular level and h. Mcroarras Affetr arras Bologcal Saple RNA Raasa and Golu Jounal of Clncal Oncolog Ra age.8c est Saple est Saple Reference PE C3 C5 5u ARRAY Olgonucleotde Snthess ARRAY cdna Clone PCR Product LIBRARY ~ 7 olgonucleotdes half Perfectl Match RNA PM half have one Msatch MM Ra gene epresson s ntenst dfference: PM - MM Mcroarra applcatons Bologcal dscover ne and etter olecular dagnostcs ne olecular targets for therap fndng and refnng ologcal pathas Recent eaples olecular dagnoss of leukea reast cancer. approprate treatent for genetc sgnature potental ne drug targets o coputatonal tasks Classfng gene epressons: ths eek What can e learnt aout a cell fro the set of all RNA epressed n a cell? Classfng dseases: does a patent have engn prostate cancer or etastatc prostate cancer? Inferrng regulator netorks: net eek What s the crcutr of the cell? What are the genetc pathas of cancer?

Coon Approaches Coparng to easureents at a te Person gene G: Person gene G: 3 Greater than 3-fold change: flag ths gene Coparng one easureent th a populaton of easureents s t unlkel that the ne

2 Coon Approaches Coparng to easureents at a te Person gene G: Person gene G: 3 Greater than 3-fold change: flag ths gene Coparng one easureent th a populaton of easureents s t unlkel that the ne easureent as dran fro sae dstruton? Classfcaton Use our knoledge of class values e.g. eloa vs. noral etc. to gan added nsght. Fnd genes that are est predctors of class. Can provde useful tests e.g. for choosng treatent. If predctor s coprehensle a provde novel nsght e.g. pont to a ne therapeutc target. Classfng gene ep data Mcroarra chps Iages scanned laser Gene Value D658_at 93 D656_cds_at -7 D656_cds_at 44 D656_cds3_at 33 D6579_at 38 D6598_at 764 D6599_at 537 D66_at 4 D84_at 77 he data Genes Saples j Epresson level of gene j for saple class Ne saple Predcton: AML or ALL Predcton functon Datasets Class Sno D658 D63874 D6388 ALL ALL ALL ALL AML AML AML Saples are laeled. Red lne denotes vector of ep levels for saple Heat aps Challenges Mcroarra data nhert large eperental and ologcal varances eperental as tssue heterogenet cross-hrdsaton ad desgn : confoundng effects Mcroarra data are sparse hgh-densonalt of genes lo nuer of saples/arras Curse of densonalt Mcroarra data are hghl redundant Man genes are co-epressed thus ther epresson s strongl correlated.

3 Classfcaton he classfcaton prole Gven eaples dran fro to classes learn to classf ne eaples nto the correct class. Each pont represents a vector of gene epresson levels?? Class? Class - Gven tranng data { } n R n n {-}. Estate functon h:r n {-} such that h ll correctl classf ne unseen eaples fro the sae underlng proalt dstruton as the tranng data. Classfcaton as optzaton Set S of tranng data ponts Class H of hpotheses/odels Optzaton prole: Fnd the hpothess/odel h n H that est fts all data. ranng Data h Hpothess Space Ojectve functon Mnzng tranng set error does not pl nzng true error! R tran R[ h] [ h] [ h ] [ h ] dp Eprcal rsk rue error Statstcal achne learnng theor Non-asptotc theor ased on fnte saples hch ounds true error n ters of tranng set error. Gves tradeoff eteen coplet of odel and aount of data needed to learn t. A ound on true error VC denson theor allos us to relate tran and test error for partcular functon classes. he ke ntuton s that the error of a functon s not an asolute ut relatve to the class of functons t s dran fro. VC hlog / VC h log δ / 4 R[ h] Rtran[ h] VCh s the VC denson of the class fro hch h s dran and delta s the proalt ound s the sze of the tranng set Vapnk

4 radeoffs Sple hpothess ll underft Wth onl a sall aount of data e can onl dscrnate eteen a sall nuer of dfferent hpotheses. As e get ore data e have ore evdence so e can consder ore alternatve hpotheses. Cople hpotheses gve etter ft to the data. Best least squares lne Cannot take advantage of ore data! Cople hpotheses ll overft Adaptve hpothess space selecton Fnd hpothess h to nze errorh λ copleth Regularzaton Support vector achnes A ne generaton of learnng algorths ased on Non-lnear optzaton Statstcs Functonal analss Coe th theoretcal guarantees on perforance ecause the learnng prole can e reduced to conve optzaton. Applcatons SVMs have een used n a de varet of tasks and are reputed to e the est for et categorzaton Handrtng recognton Classfcaton of gene epresson data 4

5 Hstor Introduced n 99 Boser Guon and Vapnk COL 99. Ver rapd groth snce then. ecellent tetooks and lots of ne ork oth n theor and applcatons..kernel-achnes.org s a great resource for learnng aout SVMs. he Prole Gven tranng data { } n R n n {-}. Estate functon h:r n {-} such that h ll correctl classf ne unseen eaples fro the sae underlng proalt dstruton as the tranng data. Lnear support vector achnes Consder the class of orented hperplanes n R n. h sgn. If data s lnearl separale then there s a functon fro ths class that separates the ponts fro the ponts. Lnear separatng hperplanes Unfortunatel there are an nfnte nuer of lnear hperplanes that separate the data! Geoetrc Margn B A d Coordnates of B d B les on lne defned d Solvng for d d Geoetrc nterpretaton - lada he optal hperplane s orthogonal to the shortest lne connectng the conve hulls of the to classes and ntersects t halfa eteen the. 5

6 .. 3. λ Margn azaton Let and - e the to ponts on the conve hulls of the postve and negatve data hch are closest to the aal argn hperplane. λ fro. and. fro 3 and aove. Lada s the argn dth It s nversel proportonal to.. So to aze argn e nze. Optal separatng hperplane Aong all separatng hperplanes there s one th the au argn. A hperplane separatng data satsfes. > f. < - f - Or n short [. ] > for.. he optal hperplane satsfes the aove condtons and has the nal nor. Learnng the au argn classfer Fnd and that nze τ suject to for.. Solvng the quadratc progra L L ust e nzed th respect to and and azed th respect to the Lagrange ultplers alpha he frst dervatve th respect to and ust vansh at the saddle pont. Quadratc prograng! Solvng the quadratc progra L hch elds hs eans has an epanson n ters of a suset of the tranng data nael those for hch alpha >. hese data ponts are called support vectors. None of the other data ponts atter. he aal argn hperplane s copletel deterned the support vectors. Solvng the quadratc progra L hch elds.... B the KK copleentart condton.. Support vectors le on the argn ecause hen alpha > then.. 6

7 Geoetrc nterpretaton - lada he optal hperplane s deterned the 3 support vectors. Soluton h sgn sgn he hperplane decson functon uses the support vectors alone and takes the dot product of the support vectors th. Note: s calculated fro the KK cop. condn. Cancer classfcaton Etenson to non-separale data Idea #: soft argn hperplane d: dstance fro hperplane est data 38 eaples of Melod and Lpholastc leukeas Golu et al 999 Affetr huan genes ncludng control genes 34 eaples to test classfer z z Slack varales Soft argn hperplanes Mnze suject to c δ δ Solvng the opt. prole L c. μ For delta ths s a conve optzaton prole. We can set up the Lagragan and solve for and zs usng the KK condtons. 7

8 8 he KK condtons KK cop. condn. hch elds hch elds hch elds hch elds L c L L L μ he soluton Fro the KK copleentart condton e get support vectors are the tranng data ponts for hch.. hat s support vectors le on the argn! Non-lnear support vector achnes A generalzaton to handle the case hen the decson functon f s knon to e not a lnear functon of the nput. Central dea: feature spaces. Map the onto a hgher densonal feature space ph. hen use lnear support vector achnes to otan the optal separatng hperplane n ths hgh densonal feature space. Eaple : 3 R R ϕ ϕ z z 3 z z 3 < R Drect appng Drect appng to a hgh densonal space suffers fro the curse of densonalt. o consder all d th order products of an n-densonal vector e have to consder nd-!/d!n-! ters For n 66 d 5 e have a densonal feature space. A closer look at decson fn Note that decson functon s of the for We onl use dot products of the nput vectors for deternng the optal separatng hperplane. sgn sgn h

9 Kernels to the rescue If e ant to fnd a separatng hperplane n the feature space e need to copute the dot product of ph and ph. Defne a kernel functon K hch returns the dot product of the ages of ts to arguents K ϕ ϕ Non-lnear support vector achnes he decson functon s of the for h sgn φ sgn K We onl use dot products of the nput vectors for deternng the optal separatng hperplane. Eaples of kernels Polnoal kernel d K Second degree polnoal kernel φ K φ φ Generalzed polnoal kernel d K c φ More kernels Eponental kernel Gaussan RBF K σ e anh kernel K tanh k δ Wolfe dual for MazeW suject to ;.. j Derved susttutng for and nto Lalpha. Advantage: azaton epressed n ters of dot products of the s. Used for learnng non-lnear SVMs j j j Mercer condton Identfes the class of functons for hch K s the dot product of ph and ph. See the ecellent tutoral C. Burges avalale fro.kernelachnes.org for a dscusson of ths condton. 9

10 General support vector achnes We ll susttute ph for n our prevous forulaton. Solutons are of the for: h sgn sgn sgn ϕ ϕ K SVM deo Clck here Feature selecton SVMs as stated use all genes/features. Molecular ologsts/oncologsts see to e convnced that onl a sall suset of genes are responsle for partcular ologcal propertes so the ant the relevant genes. AML vs ALL: 4 genes 34/34 correct rejects. 5 genes 3/3 correct 3 rejects of hch s an error. d: dstance frohperplane Results th feature selecton est data d: dstance frohperplane M u est data Mukherjee et. al. 5 est data o feature selecton technques Recursve feature elnaton RFE: ased upon perturaton analss elnate genes that pertur the argn the least. Optze leave one out LOO: ased on the optzed leave-one-out error of an SVM. Recursve feature elnaton. Solve the SVMprole for vector. Rank order eleents of vector asolute value 3.Dscard nput features/genes correspondng to those vector eleents th sall asolute agntude for sallest % 4.Retran SVMonreduced gene set and goto step

11 Leave one out estator Leave one pont out tran on the others test on the left out pont. Repeat ths for ever pont n the tranng data. Leave-one-out estate s alost unased. Leave-one-out feature selecton Use the LOO estator as an ojectve functon n the search for susets of features.

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015 Machne Learnng 0-70 Fall 205 Support Vector Machnes Erc Xng Lecture 9 Octoer 6 205 Readng: Chap. 6&7 C.B ook and lsted papers Erc Xng @ CMU 2006-205 What s a good Decson Boundar? Consder a nar classfcaton