Lecture INF4350 October 12008

Similar documents
Student Activity 3: Single Factor ANOVA

Tests for the Ratio of Two Poisson Rates

Non-Linear & Logistic Regression

Chapter 9: Inferences based on Two samples: Confidence intervals and tests of hypotheses

Section 11.5 Estimation of difference of two proportions

The steps of the hypothesis test

Comparison Procedures

Continuous Random Variables

Chapter 5 : Continuous Random Variables

14.3 comparing two populations: based on independent samples

Lecture 21: Order statistics

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

For the percentage of full time students at RCC the symbols would be:

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

Lecture 3 Gaussian Probability Distribution

Math 1B, lecture 4: Error bounds for numerical methods

Discrete Least-squares Approximations

Acceptance Sampling by Attributes

4.1. Probability Density Functions

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Design and Analysis of Single-Factor Experiments: The Analysis of Variance

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

03 Qudrtic Functions Completing the squre: Generl Form f ( x) x + x + c f ( x) ( x + p) + q where,, nd c re constnts nd 0. (i) (ii) (iii) (iv) *Note t

13: Diffusion in 2 Energy Groups

Section 5.1 #7, 10, 16, 21, 25; Section 5.2 #8, 9, 15, 20, 27, 30; Section 5.3 #4, 6, 9, 13, 16, 28, 31; Section 5.4 #7, 18, 21, 23, 25, 29, 40

2008 Mathematical Methods (CAS) GA 3: Examination 2

8 Laplace s Method and Local Limit Theorems

Math 31S. Rumbos Fall Solutions to Assignment #16

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Expectation and Variance

Partial Derivatives. Limits. For a single variable function f (x), the limit lim

The Fundamental Theorem of Calculus. The Total Change Theorem and the Area Under a Curve.

BRIEF NOTES ADDITIONAL MATHEMATICS FORM

Orthogonal Polynomials and Least-Squares Approximations to Functions

Testing categorized bivariate normality with two-stage. polychoric correlation estimates

Monte Carlo method in solving numerical integration and differential equation

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Best Approximation in the 2-norm

Problem Set 3 Solutions

Review of Calculus, cont d

INTRODUCTION TO INTEGRATION

University of Texas MD Anderson Cancer Center Department of Biostatistics. Inequality Calculator, Version 3.0 November 25, 2013 User s Guide

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

Continuous Random Variables

A Brief Review on Akkar, Sandikkaya and Bommer (ASB13) GMPE

1 Probability Density Functions

MATH SS124 Sec 39 Concepts summary with examples

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Chapter 1: Fundamentals

Recitation 3: More Applications of the Derivative

Scientific notation is a way of expressing really big numbers or really small numbers.

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

Riemann Sums and Riemann Integrals

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Math 426: Probability Final Exam Practice

Multivariate problems and matrix algebra

Read section 3.3, 3.4 Announcements:

Math 113 Fall Final Exam Review. 2. Applications of Integration Chapter 6 including sections and section 6.8

Objectives. Materials

Construction of Gauss Quadrature Rules

Theoretical foundations of Gaussian quadrature

Preparation for A Level Wadebridge School

DIRECT CURRENT CIRCUITS

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Riemann Sums and Riemann Integrals

Construction and Selection of Single Sampling Quick Switching Variables System for given Control Limits Involving Minimum Sum of Risks

Equations and Inequalities

CHM Physical Chemistry I Chapter 1 - Supplementary Material

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

SCHEME OF WORK FOR IB MATHS STANDARD LEVEL

3.4 Numerical integration

A New Statistic Feature of the Short-Time Amplitude Spectrum Values for Human s Unvoiced Pronunciation

Experiments with a Single Factor: The Analysis of Variance (ANOVA) Dr. Mohammad Abuhaiba 1

Normal Distribution. Lecture 6: More Binomial Distribution. Properties of the Unit Normal Distribution. Unit Normal Distribution

Best Approximation. Chapter The General Case

Chapter 14. Matrix Representations of Linear Transformations

Consolidation Worksheet

Predict Global Earth Temperature using Linier Regression

Chapter 0. What is the Lebesgue integral about?

Math 116 Final Exam April 26, 2013

Data Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed.

Math 135, Spring 2012: HW 7

Riemann is the Mann! (But Lebesgue may besgue to differ.)

n f(x i ) x. i=1 In section 4.2, we defined the definite integral of f from x = a to x = b as n f(x i ) x; f(x) dx = lim i=1

Population Dynamics Definition Model A model is defined as a physical representation of any natural phenomena Example: 1. A miniature building model.

Physics 201 Lab 3: Measurement of Earth s local gravitational field I Data Acquisition and Preliminary Analysis Dr. Timothy C. Black Summer I, 2018

CBE 291b - Computation And Optimization For Engineers

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that

Hybrid Group Acceptance Sampling Plan Based on Size Biased Lomax Model

ODE: Existence and Uniqueness of a Solution

New Expansion and Infinite Series

New data structures to reduce data size and search time

2.4 Linear Inequalities and Problem Solving

Lesson 1: Quadratic Equations

Big idea in Calculus: approximation

Estimation of Binomial Distribution in the Light of Future Data

Chapter 6 Continuous Random Variables and Distributions

Transcription:

Biosttistics ti ti Lecture INF4350 October 12008 Anj Bråthen Kristoffersen Biomedicl Reserch Group Deprtment of informtics, UiO Gol Presenttion of dt descriptive tbles nd grphs Sensitivity, specificity, it ROC curve Hypotese testing Type I nd type II error Multiple testing Flse positives Ctegoricl vribles Ordinl: Vrible types Are you smoking? 1 Dily, 2 now nd then, 3 Stopped lst yer, 4 Stopped erlier, 5 never Nominl (Discrete vribles): Civil stte: 1 not mrried, 2 mrried, 3 hve prtner, 4 divorced, 5 widow DNA (A, T, C, G) Binry vribles (0, 1) Continues vribles numbers Vribles cn be Independent d Are not influenced by other vribles. Are not influenced by the event, but could influence the event Dependent Vribles influence ech other. For instnce would the informtion tht gene is on/of possible influence n other gene. Which vrible tht depend/influence the other vrible cn often not be defined.

Properties: Averge (men) All observtions must be known The observtions do not need to be order Sensible for outliers (extreme, untypicl) vlues. Eqution: Adjusted men Men bsed on the centrl observtions: 90 95 % of the observtions; the tle (5 10 %) of the dt is not included for clcultions. Less sensible for extreme observtions. x x x K x n 1 2 n 1 1 n xi n i 1 Eqution: x Combining mens n1x1 n2x2 K n m x m n n K 1 2 n m Where n i is the number of observtions behind the men. Note tht djusted mens cn not be combined like this. x i Synonym: Medin 50 percentile Empiricl medin Properties: The observtions re ordered Medin the vlue tht divides the observtions in two prts. Not sensitive for extreme observtions. Mthemticl not good since the medin of more then one set of observtions cn not be combined.

Mode The observtion tht occur most times. Mthemticl not good since the medin of more then one set of observtions cn not be combined. Dispersl mesures Rnge X n X 1 Sme oneness s the observtions Sensible for extreme observtions Quntiles, percentiles The numerl V p tht hs p proportions of the ordered observtions below. (0<p<1) Sme oneness s the observtions n Stndrd devition 1 2 sd ( x i x) Alwys positive n 1 i 1 Outlying observtions contribute most Sme oneness s the observtions Stndrd devition If the dt is close to Gussin distributed pproximtely 95% of the popultion re within x ± 1. 96 sd Which pproximtely correspond to the 2.5 nd 97.5 percentile A consequences of the properties of the Gussin distribution Depends on pproximtely symmetry y nd unimodlity. Quick nd dirty: Rnge sd 4 Hndy when first guess of the sd when clculting the necessry numbers of observtions. Descriptive sttistics - tbles A sclr vrible: Clculte men, medin nd stndrd devition A ctegoricl vrible: Descriptive sttistics frequencies Two ctegoricl vribles: Descriptive sttistics cross tble A sclr vrible nd ctegoricl vrible compre men/medin for ech ctegory Two sclr vrible: Ctegorise one of the vribles or: liner regression

Do lwys plot your dt A plot tells more thn 1000 tests A sclr vrible: Histogrm Box-plot Compre the dt with the Gussin distribution: Q-Q plot esier to red nd explin thn Gussin curve upon histogrm QQplot Histogrm Do lwys plot your dt A plot tells more thn 1000 tests Two sclr vrible: Sctter plot Do lwys plot your dt A plot tells more thn 1000 tests A sclr nd ctegoricl vrible Box plot Sctter plot Two sclr nd ctegoricl vrible: Sctter plot

Number of bbies born Exmple probbility of getting t boy Number of boys Prosentge boys 10 8 0.8 100 55 0.55 1000 525 0.525 10000 5139 0.5139 100000 51127 0.51127 376058 1927054 0.51247 17989361 9219202 0.51248 34832051 17857857 0.51268 Reltive risk A { Positive mmmogrm} B { Brest cncer within two yers} Pr Pr ( B A) 0.1 ( B A) 0. 0002 ( B A) Pr RR Pr B A ( ) 0.1 500 0.0002 Prevlence, sensitivity, specificity, nd more A { symptom or positive dignostic testt } B { ill} P( B) prevlence of the illness P( A B) sensitivity P P P ( A B ) flse positive rte ( A B ) specificity ( A B ) P( A B ) 1 P ( A B ) 1 P ( A B ) 1 spesifisit y Then we hve P P ( B A) PPV PV positive predictive vlue ( B A ) NPV NV negtive predictive vlue A B Pr Exmple brest cncer dignostic { positive mmmogrm} { brest bes cncer ce within two yers } ( B A) 0.0002 then Pr( B A) Pr ( B A ) 0.9998 NPV PPV Pr ( B A) 0. 1 1 0.0002 0.9998

Exmple brest cncer in different groups Brest cncer Brest cncer mong women 45 to 54 yers old Group A: gve first birth before 20 yer old Group B: gve first birth fter 30 yer old Assume tht 40 of 10000 in group A nd 50 of 10000 in group B get cncer, coincidence id or different risk? If the numbers where 400 of 100000 nd 500 of 100000? Still coincidence? Test result Trditionl 2 2 tble ill - [TP] b [FP] b - c [FN] d [TN] cd c bd bcd TP true positive, FP flse positive, FN flse negtive, TN true negtive Anlyse v 2 2 tbell Fisher showed tht the probbility of obtining ny such set of vlues ws given by the hypergeometric distribution: b c d c ( b )! ( c d )! ( c )! ( b d )! p b c d ( b c d)!! b! c! d! c If the p vlue is less thn cut off (i.e. p<0.05) we ssume tht we cn reject the null hypotheses nd ssume tht t true odds rtio is not equl to 1, hence the test result differentite between ill nd not ill. Exmple brest cncer > fisher.test(mtrix(c(40,9960,50,9950),ncol 2, byrowtrue)) Fisher's Exct Test for Count Dt dt: mtrix(c(40, 9960, 50, 9950), ncol 2, byrow TRUE) p-vlue 0.3417 lterntive hypothesis: true odds rtio is not equl to 1 95 percent confidence intervl: 0.5133146 1.2371891 smple estimtes: odds rtio 0.7992074 c 40 9960 50 9950 > fisher.test(mtrix(c(400,99600,500,99500),ncol ( (,, 2, byrowtrue)) 400 99600 Fisher's Exct Test for Count Dt 500 99500 dt: mtrix(c(400, ( 99600, 500, 99500), ncol 2, byrow TRUE) p-vlue 0.0009314 lterntive hypothesis: true odds rtio is not equl to 1 95 percent confidence intervl: 0.6987355 0.9135934 smple estimtes: odds rtio 0.7991994 b d

Prevlence, sensitivity, specificity, nd more c Prevlence Pr( B) b c d Sensitivity Pr( A B) c d Specificity Pr( A B ) b d PPV Pr( B A) b d NPV Pr( B A) c d d Accurcy b c d Testing hypotheses Find null nd n lterntive hypothesis Exmple: H 0 : Expected response is equl in both groups H 1: Expected response is different between groups. p-vlue: is the probbility to observe the observed vlues given tht H 0 is true. Reject H 0 if the p-vlue is less thn given significnce level (e.g. 0.0505 or 0.01) 01) Sttisticl tests Some tests ssume certin distribution E.g.: t-test ssume tht the dt re (pproximtely) Gussin distributed Non prmetric tests re more flexible E.g.: compring two medins: non prmetric test, t two independent d groups (Mnn-Whitney) Sttistisk test metoder Two ctegoricl vribles: Fisher test Chi squre testt Mnn-Whitney Two sclr vribles: ttest t.test A sclr nd ctegoricl vrible: nov

The t-test test The t sttistic is bsed on the smple men nd vrince t Mnn-Whitney In order to pply the Mnn-Whitney test, the rw dt from smples A nd B must first be combined into set of Nn n b elements, which re then rnked from lowest to highest. These rnkings re then re-sorted into the two seprte smples. The vlue of U reported in this nlysis is the one bsed on smple A, clculted s smple A clculted s n ( n 1 1 ) U A n n where T A the observed sum of rnks for smple A, nd n n b n ( n 1 ) 2 b 2 T the mximum possible vlue of T A Convert the U sttistics into p-vlues. A ANOVA The t-test nd its vrints only work when there re two smple pools. Anlysis of vrince (ANOVA) is generl technique for hndling multiple vribles, with replictes. A simple experiment Mesure response to drug tretment in two different mouse strins. Repet ech mesurement five times. Totl experiment 2 strins * 2 tretments t t * 5 repetitions 20 rrys If you look for tretment effects using t- test, then you ignore the strin effects.

ANOVA lingo Two-fctor design Fctor: vrible tht t is under the control of the experimenter (strin, tretment). Level: possible vlue of fctor (drug, no drug). Min effect: n effect tht involves only one fctor. Interction effect: n effect tht involves two or more fctors simultneously. Blnced design: n experiment in which ech fctor nd level is mesured n equl number of times. Fixed nd rndom effects Fixed effect: fctor for which h the levels l would be repeted exctly if the experiment were repeted. Rndom effect: term for which the levels would not repet in replicted experiment. In the simple experiment, tretment nd strin re fixed effects, nd we include rndom effect to ccount for biologicl nd experimentl vribility. ANOVA model i 1, K, n, E ijk μ T i S j ( TS ) ij ε ijk j 1, K, m, k 1, K, p. μ is the men expression level of the gene. T nd S re min effects (tretment, strin) with n nd m levels, respectively. TS is n interction effect. p is the number of replictes per group. ε represents rndom error (to be minimized).

ANOVA steps For ech gene on the rry Fit the prmeters T nd S, minimizing ε. Test T, S nd TS for difference from zero, yielding three F sttistics. Convert the F sttistics into p-vlues. F-sttistics Compre two liner models. Men Squres Group MSG F or Men Squres Error MSE This compres the vrition between groups (group men to group men) to the vrition within groups (individul vlues to group mens). F-distribution Pr( F > F df df 2 1, clculted ) ANOVA ssumptions A B Gene ANOVA output p-vlue For given gene, the rndom error terms re independent, normlly distributed nd hve uniform vrince. The min effects nd their interctions re liner. Strin effects Tretment effects Interction effects Vehicle Drug

Multiple testing correction This nd some following slides re from http://compdig.molgen.mpg.de/ngfn/docs/2004/mr/differentilgenes.pdf. Multiple testing correction On n rry of 10,000 spots, p-vlue of 0.0001 my not be significnt. Bonferroni correction: divide your p-vlue cutoff by the number of mesurements. For significnce of 0.05 with 10,000 spots, you need p-vlue of 5 10-6. Bonferroni is conservtive becuse it ssumes tht ll genes re independent.

Types of errors Flse discovery rte Flse positive (Type I error): the experiment indictes tht the gene hs chnged, but it ctully hs not. Flse negtive (Type II error): the gene hs chnged, but the experiment filed to indicte the chnge. Typiclly, reserchers re more concerned bout flse positives. Without doing mny (expensive) replictes, there will lwys be mny flse negtives. 5 FP 13 TP 33 TN 5 FN The flse discovery rte (FDR) is the percentge of genes bove given position in the rnked list tht re expected to be flse positives. Flse positive rte: percentge of non-differentilly expressed genes tht re flgged. Flse discovery rte: percentge of flgged genes tht re not differentilly expressed. FDR FP / (FP TP) 5/18 27.8% FPR FP / (FP TN) 5/38 13.2% 5 FP 13 TP 33 TN 5 FN Controlling the FDR FDR exmple Order the undjusted p-vlues p 1 p 2 p m. To control FDR t level α, Rnk of this p-vlue of this gene gene j j* mx j : p j α m Reject the null hypothesis for j 1,, j*. Desired significnce threshold Totl number of genes This pproch is conservtive if mny genes re differentilly expressed. Rnk (jα)/m p-vlue 1 0.00005 0.0000008 2 0.0001000010 0.00000120000012 3 0.00015 0.0000013 4 0.00020 0.0000056 5 0.0002500025 0.00000780000078 6 0.00030 0.0000235 7 0.00035 0.0000945 8 0.0004000040 0.00024500002450 9 0.00045 0.0004700 10 0.00050 0.0008900 1000 0.05000 1.0000000 Choose the threshold so tht, for ll the genes bove it, (jα)/m is less thn the corresponding p- vlue. Approximtely 5% of the exmples bove the line re expected to be flse positives. (Benjmini & Hochberg, 1995)

Bonferroni vs. flse discovery rte Flse discovery rte Bonferroni controls the fmily-wise error rte; i.e., the probbility of t lest one flse positive. FDR is the proportion of flse positives mong the genes tht re flgged s differentilly expressed. Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Sttus Definitively Probble unsure Probbly Definitively Totl norml norml not norml not norml Norml 33 6 6 11 2 58 Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Sttus Definitively Probble unsure Probbly Definitively Totl norml norml not norml not norml Norml 33 6 6 11 2 58 Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Criteri 1 ll with rnge from 1 to 5 get the dignose ill. Find ll the ill ones, but identify now helthy ones. Sensitivity 1, specificity 0, flse positive rte 1 Criteri 2 ll with rnge from 2 to 5 get the dignose ill. Find 48/51 of the ill ones, but identifies 33/58 helthy ones. Sensitivity 0.94, specificity 0.57, flse positive rte 0.43

Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Sttus Definitively Probble unsure Probbly Definitively norml norml not norml not norml Norml 33 6 6 11 2 58 Totl Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Sttus Definitively Probble unsure Probbly Definitively Totl norml norml not norml not norml Norml 33 6 6 11 2 58 Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Criteri 3 ll with rnge from 3 to 5 get the dignose ill. Find 46/51 of the ill ones, but identifies 39/58 helthy ones. Sensitivity 0.90, specificity 0.67, flse positive rte 0.33 Criteri 4 ll with rnge from 4 to 5 get the dignose ill. Find 44/51 of the ill ones, but identifies 45/58 helthy ones. Sensitivity 0.86, specificity 0.78, flse positive rte 0.22 Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Dignostic/ROC curve Rnging g of 109 CT imges of one rdiologist: Sttus Definitively Probble unsure Probbly Definitively Totl norml norml not norml not norml Norml 33 6 6 11 2 58 Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Sttus Definitively Probble unsure Probbly Definitively Totl norml norml not norml not norml Norml 33 6 6 11 2 58 Not 3 2 2 11 33 51 norml Totl 36 8 8 22 35 109 Criteri 5 ll with rnge from 2 to 5 get the dignose ill. Find 33/51 of the ill ones, but identifies 56/58 helthy ones. Sensitivity 0.65, specificity 0.97, flse positive rte 0.03 Criteri 6 ll with rnge > 5 get the dignose ill. Find non of the ill ones, but identifies ll the helthy ones. Sensitivity 0, specificity 1, flse positive rte 0

Dignostic/ROC curve Refernser Positiv test criteri sensitivity specificity Flse positive rte 1 1 0 1 2 0.94 0.57 0.43 3 0.90 0.67 0.33 4 0.86 0.78 0.22 5 0.65 0.97 0.03 6 0 1.0 0 http://www.medisin.ntnu.no/ikm/medstt/m edstt1.07.dg1.pdf http://www.medisin.ntnu.no/ikm/medstt/m edstt1.07dg2.snns.pdf http://noble.gs.wshington.edu/~noble/gen ome373/microrry nlysis: ANOVA nd multiple testing correction