Lecture 8: Lear egresso May 4, GENOME 56, Sprg Goals Develop basc cocepts of lear regresso from a probablstc framework Estmatg parameters ad hypothess testg wth lear models Lear regresso Su I Lee, CSE & GS sulee@uw.edu egresso Techque used for the modelg ad aalyss of umercal data Explots the relatoshp betwee two or more varables so that we ca ga formato about oe of them through kowg values of the other egresso ca be used for predcto, estmato, hypothess testg, ad modelg causal relatoshps Why Lear egresso? Suppose we wat to model the outcome varable Y terms of three predctors, X, X, X Y = f (X, X, X ) Typcally wll ot have eough data to try ad drectly estmate f Therefore, we usually have to assume that t has some restrcted form, such as lear Y = X + X + X 4
egresso Termology Y = X + X + X Lear egresso s a Probablstc Model Much of mathematcs s devoted to studyg varables that are determstcally related to oe aother Y = β + β X Depedet Varable Outcome Varable Idepedet Varable Predctor Varable Lug cacer rsk smokg espose Varable Lug cacer rsk Explaatory Varable Geetc factor, smokg, det, etc Itercept term Slope Expresso level of gee X Expresso levels of X s Ts A, B ad C 5 But we re terested uderstadg the relatoshp betwee varables related a odetermstc fasho 6 A Lear Probablstc Model Defto: There exsts parameters β, β ad σ, such that for ay fxed value of the predctor varable X, the outcome varable Y s related to X through the model equato Y = β + β X + ε, Lug cacer rsk where ε s a V assumed to be N(, σ ) Implcatos The expected value of Y s a lear fucto of X, but for fxed value x, the varable Y dffers from ts expected value by a radom amout Varables & Symbols: How s x dfferet from X? Captal letter X: a radom varable Lower case letter x: correspodg values (.e. the real umbers the V X map to) or example, X: Geotype of a certa locus x:, or (meag AA, AG ad GG, respectvely) smokg 7 8
Implcatos The expected value of Y s a lear fucto of X, but for fxed value x, the varable Y dffers from ts expected value by a radom amout Graphcal Iterpretato ormally, let x* deote a partcular value of the predctor varable X, the our lear probablstc model says: E( Y x*) Y x* V ( Y x* ) Y x* mea valueof Y whe X s x* varace of Y whe X s x* E( Y x*) Y x* V ( Y x* ) Y x* mea valueof Y whe X s x* varace of Y whe X s x* 9 Graphcal Iterpretato Weght Say that X = heght ad Y = weght Heght The μ Y x=6 s the average weght for all dvduals 6 ches tall the populato Oe More Example Suppose the relatoshp betwee the predctor varable heght (X) ad outcome varable weght (Y) s descrbed by a smple lear regresso model wth true regresso le Y = 7.5 +.5 X, ε~ N(, σ ) ad σ = Q: What s the terpretato of β =.5? The expected chage weght (Y ) assocated wth a ut crease heght (X ) Q: If x =, what s the expected value of Y? μ Y x= = 7.5 +.5 () = 7.5
Oe More Example Q: If x =, what s P(Y>)? Y = μ Y x= = 7.5 Y ~ N(μ =7.5, σ = ) Estmatg Model Parameters Where are the parameters β ad β from? Predcted, or ftted, values are values of y predcted by pluggg x, x,, x to the estmated regresso le: y = β + β x x x x x= Gve Y ~ N(μ =7.5, σ = ), 7.5 P( Y x ) ( ) (.5).67 where ϕ meas the CD of Normal dst. N(,) σ = μ =7.5 Y > esduals are the devatos of observed (red dots) ad predcted values (red le) y y x x x y = β + β x e y e y e y 4 esduals Are Useful! The error sum of squares (SSE) ca tell us how well the le fts to the data SSE ( e ) ( y ) d β ad β that mmzes SSE Deote the solutos by ˆ ad ˆ, ) y ( x ) f ( x x x y = β + β x x x x 5 Least Squares d β ad β that mmzes SSE, ) y ( x ) f ( 6 4
Least Squares Least Squares d β ad β that mmzes SSE d β ad β that mmzes SSE, ) y ( x ) f ( 7, ) y ( x ) f ( 8 Least Squares Coeffcet of Determato Importat statstc referred to as the coeffcet of determato ( ): SSE SST SSE ( e ) ( y ) Error Sum Squares d β ad β that mmzes SSE, ) y ( x ) f ( 9 SST ( y y) y = β + β x y = average y Error Sum Squares, whe β = avg(y) ad β = 5
Multple Lear egresso Categorcal Idepedet Varables Exteso of the smple lear regresso model to two or more depedet varables Qualtatve varables are easly corporated regresso framework through dummy varables y = β + β x + β x + + β x + ε Expresso = Basele + Age + Tssue + Sex + Error Smple example: sex ca be coded as / Partal egresso Coeffcets: What f my categorcal varable cotas three levels: β effect o the outcome varable whe creasg the th predctor varable by ut, holdg all other predctors costat X f AA f AG f GG NO! Categorcal Idepedet Varables Hypothess Testg: Model Utlty Test Prevous codg would result collearty Soluto s to set up a seres of dummy varable. I geeral for k levels you eed (k ) dummy varables X f AA otherwse X f AG otherwse The frst thg we wat to kow after fttg a model s whether ay of the predctor varables (X s) are sgfcatly related to the outcome varable (Y): H H A : β β : At least oe β Let s frame ths our ANOVE framework β k X X f AA f AG f GG AA AG GG X X I ANOVA, we parttoed total varace (SST) to two compoets: SSE (uexplaed varato) SS (varato explaed by lear model) 6
Model Utlty Test Partto total varace (SST) to two compoets: SSE (uexplaed varato) SS (varato explaed by lear model) ANOVA ormulato of Model Utlty Test Partto total varace (SST) to two compoets: SSE (uexplaed varato) SS (varato explaed by lear model) Let s cosder (=) data pots ad k (=) predctor model y = β + β x y = average y SST SSE ( y y) ( e ) ( y ) (k+) (k+) SS ( y) 5 # data pots (# parameters the model) ejecto ego :, k, ( k ) 6 ANOVA ormulato of Model Utlty Test test statstc MS MS E SS / k SSE / ( k ) ejecto ego : ( k ) k, k, ( k ) Pck the dstrbuto fucto, based o k ad -(k+). Choose the crtcal value based o α ( α,k,-(k+) ) Say that α =.5 Prob(> α,k,-(k+) ) =.5 7 Test or Subsets of Idepedet Varables A powerful tool multple regresso aalyss s the ablty to compare two models or stace say we wat to compare ull Model: y β educed Model: y β β x β x β x β x Aga, aother example of ANOVA SSE error sum of squares for reduced model wth l predctors SSE error sum of squares for full model wth k predctors β x β x 4 4 (SSE SSE ) /( k l) SSE /[ ( k )] 8 7
Example of Model Comparso We have a quattatve trat ad wat to test the effects at two markers, M ad M. ull Model: Trat Mea M M (M X) educed Model: Trat Mea M M (SSE SSE ) /( k l) (SSE SSE ) /( ) SSE /[ ( k )] SSE /[ ()] (SSE SSE ) SSE / 96 ejecto ego :,,96 9 How To Do I You ca ft a least squares regresso usg the fucto mm < lsft(x,y) The coeffcets of the ft are the gve by mm$coef The resduals are mm$resduals Ad to prt out the tests for zero slope just do ls.prt (mm) Iput Data http://www.cs.washgto.edu/homes/sulee/geo me56/data/cats.txt Data o fluctuatg proportos of marked cells marrow from heterozygous Safar cats Proportos of cells of oe cell type samples from cats (take our departmet may years ago). Colum s the ID umber of the partcular cat. You wll wat to plot the data from oe cat. or example cat 44 s rows :7, 45a s 8:, 45b s :47, 46 s 48:65, 4665 s 66:8 ad so o. Iput Data d colum: Tme, weeks from the start of motorg, that the measuremet from marrow s recorded. rd colum: Percet of domestc type progetor cells observed a sample of cells at that tme. 4 th colum: Sample sze at that tme,.e. the umber of progetor cells aalyzed. Cat # Cat # 8