Limma = linear models for microarray data. Linear models and Limma. Expression measures. Questions of Interest. y ga = log 2 (R/G) y ga =

Size: px
Start display at page:

Download "Limma = linear models for microarray data. Linear models and Limma. Expression measures. Questions of Interest. y ga = log 2 (R/G) y ga ="

Transcription

1 Linear models and Limma Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute (with many slides taken from Gordon Smyth) 1 Limma = linear models for microarray data o Morning Theory! Introduction! Background correction! Moderated t-tests! Simple linear models o Morning Practical! Demonstration of smoothing! Limma objects (beta7)! Background correction and normalization (beta7)! Simple experimental designs! 2-colour example (beta7)! Affymetrix example (cancer) o Afternoon Theory! More advanced designs / linear modeling! Moderated F-tests! Gene set tests! Other analyses limma can do o Afternoon practical! Factorial design (estrogen)! Gene set testing (cancer)! Time course experiment (SAHA/depsipeptide) Two-colour Affymetrix Illumina Expression measures array y ga = log 2 (R/G) probe or gene y ga = y ga = log-intensity (summarized over probes) log-intensity (summarized over beads) 3 Questions of Interest o What genes have changed in expression? (e.g. between disease/normal, affected by treatment) Gene discovery, differential expression o Is a specified group of genes all up-regulated in a particular condition? Gene set differential expression o Can the expression profile predict outcome? Class prediction, classification o Are there tumour sub-types not previously identified? Do my genes group into previously undiscovered pathways? Class discovery, clustering Today will cover first two questions - differential expression 4 3 4

2 Microarray Differential Expression Studies o genes/probes/exons on a chip or glass slide o Inputs to limma: log-intensities (1-colour data) or log(r/g) log-ratios (for 2-colour data) o Several steps to go from raw data to table of expression : background correction, normalization o Idea: Fit a linear model to the expression data for each gene 5 Two colour microarrays Two-colour data: Log-Intensities For each probe: R = R! R * log 2( f b ) G = G! G * log 2( f b ) Two-colour data: Means and Differences For each probe: M = R! G A = ( R + G)/ 2 minus add Various ways to calculate background. Will often modify to ensure: R f R b >0 and G f G b >

3 Data Summaries MA Plot For each gene G 1 R 1 G 2 R 2 G 3 R 3 G 4 R 4 M 1 A 1 M 2 A 2 M 3 A 3 M 4 A Log-Ratios or Single Channel Intensities? o Tradition analysis, treats log-ratios M=log(R/G) as the primary data, i.e., gene expression measurements are relative o Alternative approach treats individual channel intensities R and G as primary data, i.e., gene expression measures are absolute (Wolfinger, Churchill, Kerr) o Single channel approach makes new analyses possible but - make stronger assumptions - requires more complex models (mixed models in place of ordinary linear models) to accommodate correlation between R and G on same spot BG correction affects DE results o Importance of careful pre-processing and quality control cannot be over-emphasized for microarray data o Can have dramatic effect on differential expression results o Consider here the normexp method of adaptive background correction - background correction step of the RMA algorithm - Can also be applied to two colour data

4 Additive + multiplicative error model Observe intensity for one probe on one array Intensity = background + signal I = B + S normexp convolution model Intensity = Background + Signal N(!,! 2 ) Exponential(") additive errors multiplicative errors = + This idea underlies variance stabilizing transformations vsn (two colour data) and vst (for Illumina data) Then with Conditional expectation under normexp model normexp background correction o Estimate the three parameters o Replace I with E(S I) o For Affymetrix data, I is the Perfect Match data intensity o For two-colour data, I=R f -R b or I=G f -G b o In the RMA algorithm, parameter estimation uses an ad hoc density kernel method o In limma (two colour), parameter estimation maximises the saddlepoint approximation to the likelihood

5 PM data on log 2 scale: raw and fitted model Background corrected intensity = E(Signal Observed Intensity) Adaptive background correction produces positive signal E( Signal Intensity) 17 Observed Intensity Offsets to stabilise the variance Background correction Log-ratios Why do offsets stabilize the variance? Offset reduces variability at low intensities

6 Why do offsets stabilize the variance? A self-self experiment: two background methods o log 2 ( 800/100 ) = log 2 ( 8/1 ) = 3 (8-fold) o Additive noise affects small numbers more o Offsets introduce bias: o log 2 [(80+10)/(10+10)] = 2.17 o But the tradeoff (drop in variance for increase in bias) is usually worth it spots not plotted 22 First statistical point choose background correction to stabilise the variance as a function of intensity Comparison of 2-colour BG correction methods References False discoveries (limma) o Silver et al. (2009). Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. [complete mathematical development of the saddle point approximation] o Ritchie et al. (2007). A comparison of background correction methods for two-colour microarrays. Bioinformatics. [shows normexp performs best for 2-colour data] o Irizarry et al. (2003). Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics. [Describes RMA BG correction, but doesn't give much detail of the normexp convolution model.] Ritchie et al Genes selected

7 Two-colour Normalization Normalization BG correction and normalization are closely connected 25 Even after BG correction, some effects remain One-colour Normalization Moderated t-tests Similarly for single channel data, adjustments need to be made for all samples to be comparable

8 Borrowing information across genes Hard and soft shrinkage o Small data sets: few arrays, inference for individual genes is uncertain o Curse of dimensionality: many tests, need to adjust for multiple testing, loss of power o Benefit of parallelism: same model is fitted for every gene. Can borrow information from one gene to another o Hard: simplest way to borrow information is to assume that one or more parameters are constant across genes o Soft: smooth genewise parameters towards a common value in a graduated way, e.g., Bayes, empirical Bayes, Stein shrinkage A very common experiment Wild-type mouse x 2 30 Ordinary t-tests Mutant mouse x 2 Gene X give very high false discovery rates Which genes are differentially expressed? Residual df = 2 n1 = n2 = 2 Affymetrix arrays 25,000 probe-sets

9 Another very common experiment Ordinary t-tests Wild-type mouse 1 Mutant mouse 1 Wild-type mouse 2 Mutant mouse 2 Which genes are differentially expressed? give very high false discovery rates n = 2 two-colour arrays 30,000 probes Residual df = Small sample size, many tests The problem: o These experiments would be under-powered even with just one gene o Yet we want to test differential expression for each of 50k genes, hence lots of multiple testing and further loss of power The solution: The same statistical model is being fitted for every gene in parallel. Can borrow strength from other genes. t-tests with common variance with residual standard deviation across genes pooled More stable, but ignores gene-specific variability

10 A better compromise Shrinkage of standard deviations Shrink standard deviations towards common value = degrees of freedom s% 1 s 0 s% 2 L s% G t g,pooled t% g d Moderated t-statistics s 1 s 2 L s G t g d 0 37 The data decides whether tg,pooled or to t g t% g should be closer to Why does it work? Hierarchical model for variances o We learn what is the typical variability level by looking at all genes, but allow some flexibility from this for individual genes o Adaptive Data Prior Posterior

11 Moderated t-statistics Posterior Statistics Posterior variance estimators Exact distribution for moderated t An unexpected piece of mathematics shows that, under the null hypothesis, %: tt gdd 0+ g The degrees of freedom add! The Bayes prior in effect adds d 0 extra arrays for estimating the variance. Baldi & Long 2001, Wright & Simon 2003, Smyth Wright and Simon 2003, Smyth Hierarchical model for means Data More on empirical Bayes statistics Prior 43 Lönnstedt and Speed 2002, Smyth

12 Posterior Odds Posterior odds of differential expression Monotonic function of t% Estimating Hyper-Parameters Closed form estimators with good properties are available: for s 0 and d 0 in terms of the first two moments of log s 2 for c 0 in terms of quantiles of the t% g Hence t% gives the best possible ranking of genes Lönnstedt and Speed 2002, Smyth Marginal Distributions Under usual likelihood model, s g is independent of the estimated coefficients. Under the hierarchical model, s g is independent of the moderated t-statistics instead Moment estimators for s 0 and d 0 Marginal moments of log s 2 lead to estimators of s 0 and d 0 : Estimate d 0 by solving where t% g!" # "$ t d0+ d 0 d0+ d with prob 1- p 1 + c / c t with prob p Finally

13 Quantile Estimation of c 0 Let r be rank of t% g in descending order, and let F(;) be the distribution function of the t-distribution. Can estimate c 0 by equating empirical to theoretical quantiles: Short note on multiple testing Get overall estimator of c 0 by averaging the individual estimators from the top p/2 proportion of the t% g Multiple testing and adjusted p-values o Traditional method in statistics is to control family wise error rate, e.g., by Bonferroni. o Holm s method is improved (step-down) modification of Bonferroni. o Controlling the false discovery rate (FDR) is more appropriate in microarray studies o Benjamini and Hochberg method controls expected FDR for independent or weakly dependent test statistics. Simulation studies support use for microarray data. o All methods can be implemented in terms of adjusted p-values. End of morning theory - Summary o Background correction, normalization are important considerations -- normexp, offsets o Moderation will generally always help -- moderation of variances is very effective o Convenient model gives known null distribution o Multiple testing

14 More complex experiments Linear models o More complex microarray experiments can be represented by linear models o For one-channel platforms, the linear model can be set up using the usual univariate linear model formulae o For two-colour platforms, the linear models have some special properties Linear Models o In general, need to specify: - Dependent variable - Explanatory variables (experimental design, covariates, etc.) o More generally: vector of observed data design matrix Vector of parameters to estimate 55 Linear Models for microarrays o Analyse all arrays together combining information in optimal way o Combined estimation of precision o Extensible to arbitrarily complicated experiments o Design matrix: specifies RNA targets used on arrays o Contrast matrix: specifies which comparisons are of interest

15 Designs # Linear Models Design # Linear models A Wild-type mouse x 2 y = log 2 ( R / G )! B " A B Mutant mouse x 2 A B " y1 # " 1 # $ y % = $ &1%! ' 2( ' ( A Ref $1 = wt log-expression B $2 = mutant % wt E[y1]=E[y2]=$1 A E[y3]=E[y4]= $1+ $2 Design matrices for 1-colour arrays are easier to specify! B C " y1 # " 1 0 # $ y % = $ &1 0 % "!1 # $ 2% $ %$ % $ y % $ 1 1 % '!2 ( ' 3( ' (! " B# A!1 " A # Ref!2 " B # A " y1 # " 1 0 # $ y % = $ &1 1 % "!1 # $ 2% $ % $'! 2 %( $ % $ % ' y3 ( ' 0 &1(!1 " B # A!2 " C # A Matrix Multiplication A B A Ref B A 1 3 B 2 C Linear Model Estimates " y1 # " 1 # "! # $ y % = $ &1 %! = $ &! % ' ( ' 2( ' (! " B# A " y1 # " 1 0 # "!1 # $ y % = $ &1 0 % "!1 # = $ &! % 1 $ 2% $ % $'! 2 %( $ % $ y % $ 1 1% $! +! % 2( ' 3( ' ( ' 1!1 " A # Ref!2 " B # A " y1 # " 1 0 # "!1 # $ y % = $ &1 1 % "!1 # = $ &! +! % 2 $ 2% $ % $'! 2 %( $ 1 % $ % $ % $ % ' y 3 ( ' 0 &1 ( ' &!2 ( 58 Obtain a linear model for each gene g Estimate models to get coefficients!1 " B # A!2 " C # A standard deviations standard errors Contrast:

16 Contrasts A contrast is any linear combination of the coefficients " j which we want to test equal to zero. Define contrasts were C is the contrast matrix. Want to test vs 61 Pax5: example of saturated design 1 Wt Pax5-/ IL-7 removed Rag1-/ Robust design can tolerate failure of some of the arrays Regression Analysis Choose 3 comparisons between the 4 RNA sources to be the coefficients of the linear model, e.g., - PW: Pax5-/- vs Wt - RW: Rag1-/- vs Wt - IW: IL-7 withdrawn vs Wt For each gene, fit a linear model with a coefficient for each contrast Any other comparisons of interest can be extracted from the linear model as contrasts 63 & m1 # &' # $! $! Exercise: Fill in the design $ m2! $ 1 0 0! matrix. $ m! $! ' 1 $! $! $ m4! $ 0 0 1! $ m! $ '! $! $! & PW # $ m6! $ ' 1 0 1! $! E$! = $! $ RW! $ m7! $ 1 ' 1 0! $! $! $ '! % IW m " $! $ m9! $ $ 0 ' 1! 0! $ m! $ 0 1 0! 10 $! $! $ m11! $ 0 ' 1 1! $! % m12 " % 0 1 ' 1" 64 Can be fitted using robust regression, but problems with se s, as Patty Solomon has observed 63 64

17 Example of factorial design WT.P1 µ 2 WT.P11 µ + a 1 5 WT.P21 µ + a 1 + a MT.P1 µ + b 3 MT.P11 µ +a 1 +b+a 1.b & y1 #, $! * $ y2! * $ y! * $! * $ y4! = * $! * $ y5! * $ y! 6 * $! * % y7 " MT.P21 µ + (a 1 +a 2 ) + b + (a 1 +a 2 )b 0) 0 ' ' & a1 # $! 0' $ a2! ' 0 $! ' b $! 0' $ a1b! ' $! 1' % a2b" 1' ( 65 Moderated F-statistics Moderated F-Statistic Doubly shrunk F statistic The idea of shrinking the variance extends immediately to multiple contrasts Moderated F-statistic The moderated F is not a monotonically function of the posterior odds A doubly shrunk F statistic can be shown have to the desired relationship to the posterior odds MST=Mean Sum of squares between Treatments Improves further the gene ranking Wright & Simon 2003, Smyth Tai and Speed 2006,

18 Single or double shrinkage o Shrinking the variances only is enough when comparing two groups o When comparing 3 or more groups, further gain can be had by shrinking the $ also (recall Stein estimator needs at least 3 means) Functional category analysis Functional category analysis o Used on a set of genes deemed to be differentially expressed o Asks the question: is my set of genes enriched for a particular molecular function? o Useful for establishing what pathways / types of genes are affected o Nowadays largely superceded by gene set tests Overlap statistics o Question: Say you have a set of 85 genes (of a total genes) known to be associated with function X. Calculate the probability of randomly selecting 40 or more of those genes in a list of 100 DE genes. o Answer:?

19 Overlap statistics o Question: Say you have a set of 85 genes (of a total genes) known to be associated with some function. Calculate the probability of randomly selecting 40 or more of those genes in a list of 100 DE genes. o Answer: Hypergeometric (i.e. the urn problem). Gene set tests N=19915 white m=85 black n=100 k= Gene sets o Test significance of a (apriori specified) group of genes o The genes might belong to a known pathway or might be the top genes from a related experiment o The set might be significant even if individual genes are not o Gene set enrichment analysis (GSEA) originated by Mootha et al PNAS 2003 and Subramanian et al PNAS Available gene set methods o GSEA: gene set enrichment analysis. Complex method using Kolmogorov-Smirnov type tests and sample permutation. Needs two-groups, many arrays, many genes and many sets. o GSA: gene set analysis. Uses combination of permutation of samples and standardization across genes. More powerful. Still needs two-groups, many genes and many sets. o GST: gene set tests using Wilcoxon test. Use randomization over genes. Applicable to linear models and small samples, but can be over-optimistic if the genes in the set are highly correlated. o Now, rotation-based gene set tests

20 Gene set tests Viewing gene sets A priori subset of genes X1, X2, X3 Xn All microarray probes, ranked by a test statistic of interest t1 t2 t3 t4 : Cell adhesion genes Genes regulated by MYB Look for ranks for set genes amongst test statistics What s the hypothesis? o Two major types of gene set tests: competitive or self-contained o Competitive: Genes in the set tend to be more strongly DE than randomly chosen genes o Self-contained: At least some genes in the set are truly DE Permutation o Competitive gene set tests are usually tested by permuting genes, but this ignores intergene correlations o Self-contained gene set tests are usually tested by permuting arrays, but this is limited to twogroup comparisons with large numbers Goeman & Bühlmann, Bioinformatics

21 Rotation gene set tests (ROAST) o Self-contained hypothesis o The first test suitable for small samples which correctly accounts for inter-gene correlation o Can handle complex linear model designs, including array weights, random effects etc Two steps: projection and rotation o Project data onto space orthogonal to nuisance parameters in the linear model o Random rotation of the orthogonal residuals provides fractional permutation, avoids granularity of p-values o Assumes multivariate normality, but proves to be highly robust against deviations from normality Set summary statistics o Compute empirical Bayes t-statistics for each gene in set o Convert to z-statistics o Mean of z 2 : good power, even when only a subset of genes respond not robust against non-normality o Mean50: mean of top half of z : good power robust References o Subramanian et al (2005). A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102, o Efron and Tibshirani (2007). On testing the significance of sets of genes. Ann Appl Stat 1,

22 Hard shrinking: examples Multi-level models I: duplicate spots o Common correlation model for within-array duplicates spots Smyth et al (2005). The use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 21, o Common correlation models for single channel analysis of two-color microarray data Smyth, G. K. (2005). Individual channel analysis of two-colour microarray data. 55th Session of the International Statistics Institute, 5-12 April 2005, Sydney, Australia Common correlation model Given a blocking factor with variance component! b2, focus on within-block correlations Common correlation model assumes Duplicate Spots o If the clone library is not too large, it is often possible to print each gene more than once on an array o Duplicates are always side-by-side or a fixed distance apart Has proved effective for technical blocking factors for which correlations are high

23 Duplicate spots are correlated o Duplicate spots are a form of technical replication, share lots of common causes o Cannot be treated as replicates on separate arrays, log-ratios from duplicate spots are correlated o How best to use duplicate spots? Usual approach is simply to average them Genes printed in duplicate pairs One pair Common Correlation Model 90 Consequences for individual genes o Assume the between-duplicate correlation is the same for every gene o Justified by the belief that the correlation springs mainly from spatial proximity o Improves estimation of variances o If the number of genes is large, then the estimator of " is very accurate, so " may be treated as known as far as inference for each individual gene is concerned o This doesn t change estimation of!g but greatly changes estimation of #g

24 Validation with Spike-In Data o Does the idea of using common correlations work in practice? o Check the ability of the common-correlation t-statistic to distinguish calibration from ratio spike-in spots o Scorecard system includes calibration controls, 3-fold up and down ratio controls, and 10-fold up and down ratio controls Multilevel models II: Separate channel analysis of two-colour data

25 Separate channel analysis of twocolour microarray data Each spot is a block of size 2 G 1 R 1 G 2 R 2 G 3 R 3 G 4 R 4 The two channels give correlated pair of values Why Use Means and Differences? If then var( R) = var( G) cor( M, A ) = 0 regardless of the correlation between R and G. Common correlation model sets intra-spot correlation equal across genes 97 An old idea dating back to Tukey, Altman & Bland Common Reference Experiment A simple normal model Ref Ref B B } µ M = B! Ref µ = ( B + Ref)/2 A Gene g, array i Ref Ref C C } µ M = C! Ref µ = ( C + Ref)/2 A Why not use the A-values as well as M-values? 99! g is the intra-spot correlation

26 Models in terms of M and A M and A parameters µ Mgi = µ Rgi! µ Ggi µ = ( µ + µ )/ 2 Agi Rgi Ggi! = 2! (1 # " ) 2 2 Mgi g g 2! =! 2 (1 + " )/ 2 Agi g g Correlation M and A are independent but have different variances. Have converted a correlated problem into a heteroscedastic problem Can estimate correlation by estimating variances: 4 1 1! # tanh " g = log 2! 2 Ag 2 Mg 103 Common correlation model o Assume the intra-spot correlation is constant across genes o Justified by (i) variance components are observed to be positively correlated and (ii) standard errors for coefficients are not sensitive to correlation value o Common correlation can then be assumed to known at individual gene level o Converts a mixed model into a weighted regression lmscfit() in limma

27 ApoAI Experiment 8 wild type arrays and 8 ApoAI-/- arrays, all relative to a common reference Median intraspot correlation is estimated as ˆ! = 0.85 The efficiency gain from using A-values is 1" ˆ! = ˆ! 105 Individual Channel Normalization Using A-values in the analysis requires that they be normalized to have comparable values between arrays: single channel normalization For the ApoAI data: Within-array loess normalization only A-quantile normalization Quantile normalization ˆ! = 0.89 ˆ! = 0.85 ˆ! = Ignoring the Reference Why not ignore the common reference channel? If we use only the red channel in the common reference experiment, var( RC $ RB ) =! " % + # n1 n & ' 2 ( so var( RC " RB ) 1 = var( M " M ) 2(1 "!) C so adjusting for a common reference is worthwhile whenever " > 0.5 B 107 Disconnected Design B D A-values make no contribution to estimating BvsC comparison (direct) M-values make no contribution to estimating BvsD comparison (indirect) Relative efficiency of indirect comparison compared to direct is 1 "! C E

28 0 Observed false discovery rates Mixed model Ignore Cor SpotCor Why common correlation is effective o Bias introduced into the variance estimation seems to be offset by increase in precision o Great simplification of mathematical model o Penalizes genes with large within-block variances Tests Rejected Data from Holloway et al (2006) BMC Bioinformatics 7, Article A Shrinkage Hierarchy o Fold changes shrinkage may not be required (unless more than 2 groups) o Genewise variances soft smoothing gives spectacular improvement o Technical replicate correlations hard smoothing has proved successful Other bits and pieces As we move from parameters of interest to higher order nuisance parameters, bias decreases in importance relative to noise

29 Some other common variations o Technical replicates o Paired samples o Array weights Summary o Borrowing strength is essential in small-scale microarray experiments o Information can be shared across genes or across arrays o Parameters may be set common between genes (correlations) or shrunk in a graduated way (standard errors) o Power can be increased by testing hypotheses for sets of genes Acknowledgements WEHI Bioinformatics o Gordon Smyth o Matt Ritchie o Alicia Oshlack o Terry Speed Garvan Epigenetics o Susan Clark o Aaron Statham o Marcel Coolen Computer Laboratories

30 Proposed schedule Getting started a.m. o ModerationDemo.R o LimmaObjects.R o BGNormalization.R o SimpleExperiments.R p.m. o FactorialDesign.R o GeneSetAnalysis.R o TimeCourse.R o Grab the files from the KU-August2009-LIMMA directory (or archive) and copy/move to a convenient location on your computer o Set the variable rootdir to that directory. For example: rootdir <- ~/Desktop/KU-August2009-LIMMA/data o Make rootdir the working directory of your R session limma package documentation o Function help pages?lmfit,?ebayes o Class help pages?"rglist-class"?"marraylm-class" o Group help pages help("06.linearmodels") o User s Guide Moderation Demo o Illustration of sampling from the model o Reduction in false discoveries o Empirical Bayes differential expression limmausersguide() o The R html help system is a good top view

31 Limma Objects BG correction / normalization demo o RGList o MAList o MArrayLM o Various procedures for BG correction, normalization o Jurkat data: 121 same vs. same Simple Experiment 1: Integrin beta7+ vs beta7 Simple Experiment 2: Cancer cells versus normal cells beta7- beta7+ Normal cell line Cancer cell line o Reading two-color data o Control spots o Background correction o Dye-swaps o Empirical Bayes differential expression 123 o Reading Affymetrix data o Simple design versus contrasts

32 Factorial Experiment: Estrogen Gene set analysis Estrogen Absent 10 hours 48 hours o Convenient functional category analysis o Newer, flexible set-based testing Present o Reading Affymetrix data o Factorial designs, contrasts Targets of SAHA and depsipeptide A time course experiment o Study effects of SAHA and depsipeptide on the acute T-cell leukemia cell line CEM o SAHA and depsipeptide are structurally different but have similar biological effects (induce death through intrinsic apoptotic pathway) o Prising out subtle differences is of great interest SAHA/depsipeptide: Experimental design SAHA Vehicle only depsipeptide 0hr 0hr 0hr 1hr 1hr 1hr 2hr 2hr 2hr 4hr 4hr 4hr 8hr 8hr 8hr hr 16hr 16hr 128 Time courses of 6 arrays were done at each time

33 Aims of experiment o Identify common responders: genes which respond similarly to SAHA and depsipeptide o Identify specific responders: genes which respond to one of SAHA or depsipeptide, but not to the other o Different responders, genes which respond to both SAHA and depsipeptide but differently, are of lesser interest Linear model analysis o Fit genewise linear models to all the arrays simultaneously o Include effects for drug x time o Allow for probe-specific dye-effects o Treat each time series of 6 arrays as a randomized block, i.e., allow arrays hybridized together to be correlated nd statistical point analyse all arrays together

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Session 06 (A): Microarray Basic Data Analysis

Session 06 (A): Microarray Basic Data Analysis 1 SJTU-Bioinformatics Summer School 2017 Session 06 (A): Microarray Basic Data Analysis Maoying,Wu ricket.woo@gmail.com Dept. of Bioinformatics & Biostatistics Shanghai Jiao Tong University Summer, 2017

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Linear Models and Empirical Bayes Methods for Microarrays

Linear Models and Empirical Bayes Methods for Microarrays Methods for Microarrays by Gordon Smyth Alex Sánchez and Carme Ruíz de Villa Department d Estadística Universitat de Barcelona 16-12-2004 Outline 1 An introductory example Paper overview 2 3 Lönnsted and

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Looking at the Other Side of Bonferroni

Looking at the Other Side of Bonferroni Department of Biostatistics University of Washington 24 May 2012 Multiple Testing: Control the Type I Error Rate When analyzing genetic data, one will commonly perform over 1 million (and growing) hypothesis

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Gordon Smyth, WEHI Analysis of Replicated Experiments. January 2004 IMS NUS Microarray Tutorial. Some Web Sites. Analysis of Replicated Experiments

Gordon Smyth, WEHI Analysis of Replicated Experiments. January 2004 IMS NUS Microarray Tutorial. Some Web Sites. Analysis of Replicated Experiments Some Web Sites Statistical Methods in Microarray Analysis Tutorial Institute for Mathematical Sciences National University of Sinapore January, 004 Gordon Smyth Walter and Eliza Hall Institute Technical

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Exam: high-dimensional data analysis February 28, 2014

Exam: high-dimensional data analysis February 28, 2014 Exam: high-dimensional data analysis February 28, 2014 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question (not the subquestions) on a separate piece of paper.

More information

Design of microarray experiments

Design of microarray experiments Design of microarray experiments Ulrich Mansmann mansmann@imbi.uni-heidelberg.de Practical microarray analysis March 23 Heidelberg Heidelberg, March 23 Experiments Scientists deal mostly with experiments

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Design and Analysis of Gene Expression Experiments

Design and Analysis of Gene Expression Experiments Design and Analysis of Gene Expression Experiments Guilherme J. M. Rosa Department of Animal Sciences Department of Biostatistics & Medical Informatics University of Wisconsin - Madison OUTLINE Æ Linear

More information

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective Second Edition Scott E. Maxwell Uniuersity of Notre Dame Harold D. Delaney Uniuersity of New Mexico J,t{,.?; LAWRENCE ERLBAUM ASSOCIATES,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley Memorial Sloan-Kettering Cancer Center July

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates

Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates September 4, 2003 Xiangqin Cui, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples Experimental Design Credit for some of today s materials: Jean Yang, Terry Speed, and Christina Kendziorski Experimental design Choice of platform rray design Creation of probes Location on the array Controls

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Package plw. R topics documented: May 7, Type Package

Package plw. R topics documented: May 7, Type Package Type Package Package plw May 7, 2018 Title Probe level Locally moderated Weighted t-tests. Version 1.40.0 Date 2009-07-22 Author Magnus Astrand Maintainer Magnus Astrand

More information

Microarray Preprocessing

Microarray Preprocessing Microarray Preprocessing Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact.

More information

Design of microarray experiments

Design of microarray experiments Design of microarray experiments Ulrich ansmann mansmann@imbi.uni-heidelberg.de Practical microarray analysis September Heidelberg Heidelberg, September otivation The lab biologist and theoretician need

More information

changes in gene expression, we developed and tested several models. Each model was

changes in gene expression, we developed and tested several models. Each model was Additional Files Additional File 1 File format: PDF Title: Experimental design and linear models Description: This additional file describes in detail the experimental design and linear models used to

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses Gavin Lynch Catchpoint Systems, Inc., 228 Park Ave S 28080 New York, NY 10003, U.S.A. Wenge Guo Department of Mathematical

More information

Network Biology-part II

Network Biology-part II Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

y ˆ i = ˆ  T u i ( i th fitted value or i th fit) 1 2 INFERENCE FOR MULTIPLE LINEAR REGRESSION Recall Terminology: p predictors x 1, x 2,, x p Some might be indicator variables for categorical variables) k-1 non-constant terms u 1, u 2,, u k-1 Each u

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are

More information

A Decision-theory Approach to Interpretable Set Analysis for High-dimensional Data

A Decision-theory Approach to Interpretable Set Analysis for High-dimensional Data A Decision-theory Approach to Interpretable Set Analysis for High-dimensional Data Simina M. Boca, Hector Corrada Bravo, Jeffrey T. Leek Johns Hopkins University Bloomberg School of Public Health Giovanni

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services

More information

Multi-level Models: Idea

Multi-level Models: Idea Review of 140.656 Review Introduction to multi-level models The two-stage normal-normal model Two-stage linear models with random effects Three-stage linear models Two-stage logistic regression with random

More information

Laurent Gautier 1, 2. August 18 th, 2009

Laurent Gautier 1, 2. August 18 th, 2009 s Processing of Laurent Gautier 1, 2 1 DTU Multi-Assay Core(DMAC) 2 Center for Biological Sequence (CBS) August 18 th, 2009 s 1 s 2 3 4 s 1 s 2 3 4 Photolithography s Courtesy of The Science Creative Quarterly,

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

18.05 Practice Final Exam

18.05 Practice Final Exam No calculators. 18.05 Practice Final Exam Number of problems 16 concept questions, 16 problems. Simplifying expressions Unless asked to explicitly, you don t need to simplify complicated expressions. For

More information

Statistics Applied to Bioinformatics. Tests of homogeneity

Statistics Applied to Bioinformatics. Tests of homogeneity Statistics Applied to Bioinformatics Tests of homogeneity Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages Name No calculators. 18.05 Final Exam Number of problems 16 concept questions, 16 problems, 21 pages Extra paper If you need more space we will provide some blank paper. Indicate clearly that your solution

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

On testing the significance of sets of genes

On testing the significance of sets of genes On testing the significance of sets of genes Bradley Efron and Robert Tibshirani August 17, 2006 Abstract This paper discusses the problem of identifying differentially expressed groups of genes from a

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the

More information

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986 1012 DOI: 10.1214/08-AOAS182 Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS BY DANIELA

More information

Specific Differences. Lukas Meier, Seminar für Statistik

Specific Differences. Lukas Meier, Seminar für Statistik Specific Differences Lukas Meier, Seminar für Statistik Problem with Global F-test Problem: Global F-test (aka omnibus F-test) is very unspecific. Typically: Want a more precise answer (or have a more

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS Ying Liu Department of Biostatistics, Columbia University Summer Intern at Research and CMC Biostats, Sanofi, Boston August 26, 2015 OUTLINE 1 Introduction

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Multiple testing: Intro & FWER 1

Multiple testing: Intro & FWER 1 Multiple testing: Intro & FWER 1 Mark van de Wiel mark.vdwiel@vumc.nl Dep of Epidemiology & Biostatistics,VUmc, Amsterdam Dep of Mathematics, VU 1 Some slides courtesy of Jelle Goeman 1 Practical notes

More information

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations and Theses in Statistics Statistics, Department of 2009 DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information