Exam: high-dimensional data analysis January 20, 2014

Similar documents
Exam: high-dimensional data analysis February 28, 2014

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

STA414/2104 Statistical Methods for Machine Learning II

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Non-specific filtering and control of false positives

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

False Discovery Rate

Lecture 14: Shrinkage

Advanced Statistical Methods: Beyond Linear Regression

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Compressed Sensing in Cancer Biology? (A Work in Progress)

Multiple testing: Intro & FWER 1

Association studies and regression

High-throughput Testing

Alpha-Investing. Sequential Control of Expected False Discoveries

Linear Regression (9/11/13)

Sparse Linear Models (10/7/13)

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Post-Selection Inference

The prediction of house price

UNIVERSITETET I OSLO

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

Stability and the elastic net

Biostatistics Advanced Methods in Biostatistics IV

3 Comparison with Other Dummy Variable Methods

Statistical testing. Samantha Kleinberg. October 20, 2009

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Peak Detection for Images

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Lasso regression. Wessel van Wieringen

Statistical Applications in Genetics and Molecular Biology

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Semi-Penalized Inference with Direct FDR Control

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Post-selection Inference for Changepoint Detection

ESL Chap3. Some extensions of lasso

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

STA 414/2104, Spring 2014, Practice Problem Set #1

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

A Short Introduction to the Lasso Methodology

Expression arrays, normalization, and error models

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Linear Models for Regression CS534

High-dimensional regression

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Bayesian Grouped Horseshoe Regression with Application to Additive Models

A Significance Test for the Lasso

y(x) = x w + ε(x), (1)

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Estimation of the False Discovery Rate

Statistical analysis of microarray data: a Bayesian approach

Looking at the Other Side of Bonferroni

Lecture 5: A step back

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

6. Regularized linear regression

Regression, Ridge Regression, Lasso

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Step-down FDR Procedures for Large Numbers of Hypotheses

Machine learning: Hypothesis testing. Anders Hildeman

Estimating subgroup specific treatment effects via concave fusion

Cross-Sectional Regression after Factor Analysis: Two Applications

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Assessing the Validity Domains of Graphical Gaussian Models in order to Infer Relationships among Components of Complex Biological Systems

Generalized Linear Models (1/29/13)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Sta$s$cs for Genomics ( )

ECE521 week 3: 23/26 January 2017

Lecture 6: Methods for high-dimensional problems

Linear Models for Regression CS534

Linear regression methods

Empirical Bayes Moderation of Asymptotically Linear Parameters

Focused fine-tuning of ridge regression

Control of Directional Errors in Fixed Sequence Multiple Testing

Specific Differences. Lukas Meier, Seminar für Statistik

Statistics for high-dimensional data: Group Lasso and additive models

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Part 6: Multivariate Normal and Linear Models

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Stat 206: Estimation and testing for a mean vector,

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Group exponential penalties for bi-level variable selection

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Variable Selection in Structured High-dimensional Covariate Spaces

Machine Learning and Data Mining. Linear regression. Kalev Kask

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Transcription:

Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish in time! Good luck!

Question A researcher is interested in the post-transcriptional regulation of an mrna by two micrornas. The researcher has conducted a small experiment measuring the expression levels of these three entities. The data are given in the following table: Observation mrna microrna microrna 2 2 2 2 3 0 4 0 2 Question a Write down the linear regression model that explains the expression levels of the mrna by those of the two micrornas. In this ignore the intercept and assume that the error has mean zero and unit variance. Question b Give the loss function associated with ridge penalized maximum likelihood estimation of the regression coefficients for the model of part a of this question. Question c Optimize this loss function with respect to the regression coefficients. In this set the ridge penalty parameter λ 2 equal to 6. Question d Instead of the traditional ridge penalty, now augment the maximum likelihood loss function with the following modified ridge penalty: 2 λ 2β 2 T β 2, where β is the regression coefficient vector. What is the effect of this penalty? In particular, explain how it differs from the traditional ridge penalty considered above. Question e Replace in the loss function of part b of this equation the traditional ridge penalty by the modified one of part d of this question. Now find the modified ridge penalized maximum likelihood estimate of β when λ 2 = 6. Question f Write down lasso analogue of the loss function employed in part e of this question, and find its optimum for λ = 6. Question 2 Consider a p-gene pathway. Gene expression data on each gene in the pathway are available from 2

an observational study involving n samples. Assume these data from the pathway can be modeled by a multivariate normal distribution. Question 2a When uncovering the conditional independence graph underlying the pathway, one may exploit the link between regression coefficients and partial correlations. Hence, irrespective of whether one directly estimates the inverse of the covariance matrix or uses a linear regression approach, the same conclusion is reached. Discuss whether you think the two approaches still give identical results in a high-dimensional setting p > n. Question 2b The topology of the conditional independence graph is known: it is known which genes in the pathway interact. For convenience, you may now assume n > p. Explain in words how you would estimate the covariance matrix of the expression levels of the p genes, taking into account the known structure of the conditional independence graph. Question 3 A biologist detected 000 significant genomic features in a 0 vs 0 comparison using a Benjamini- Hochberg FDR threshold t = 0.. Now she aims at validating those results. She uses a new type of high-resolution microarray which measures 3 times as many genomic features as the the one used for the original experiment, including all the genomic features of the original one. She selects 20 0+0 new samples from exactly the same population as the 20 original ones, follows exactly the same laboratory protocols and performs the same analysis as before. She is surprised to find only 300 significant features. Question 3a Argue what could have caused this apparent loss of power. Question 3b Assume now that the new experiment is only a partial confirmation experiment: she has means to run a third, final validation experiment with larger sample sizes on all features that are significant according to this 2nd experiment. Advise her on how to perform p-value based multiple testing correction on this 2nd experiment. Question 3c Now assume that the 2nd experiment is really the final validation experiment. Advise her on how to perform the p-value based multiple testing correction in this case. Question 4 A researcher has performed an RNAseq experiment and now wishes to analyze the data. The following factors have to be accounted for: I the main comparison of interest is between 2 groups of 5 individuals; II the experiment contains four repeated measurements for 0 individuals; these repeats are all within one group; III the experiment is done in two batches, equally spread over groups and individuals. Question 4a 3

Write down the full model that you would like to use to analyse this data. Question 4b For the parameter that codes for the group difference, say β g i i denotes genomic feature, the researcher wishes to test H 0i : β ig 0.5. He decides to use a Gaussian mixture prior with three components for this instead of a simple Gaussian. Why is this is a good choice? Question 4c With INLA/ShrinkBayes we obtain fits under each of the three Gaussian components separately, and hence also posterior probabilities like: π β g i > 0.5 Y i, C k which denotes the posterior tailprobability under the kth component of the mixture prior. Show how we can compute the posterior probability of interest, π β g i > 0.5 Y i from these probabilities, the prior and other results from INLA. Question 4d It turns out that the RNAseq experiment contains a lot of noisy genomic features that are biologically not interesting and which can a priori be excluded. Is this a valid and wise approach and what consequences would it have for the Gaussian mixture prior? Question 4e The INLA computations take a lot of time. The researcher has all the results from the original prior. Show how to recalculate the posteriors without applying INLA again. 4

Answer to question Answer to question a Let Y i, X i,, X i, and εi be random variables representing the expression levels of the mrna, microrna, microrna 2, and the error in sample i. The linear regression model is Y i = β X i, + β 2 X i,2 + ε i for i =,..., n, with ε i N 0, σ 2 and Covε i, ε i2 = 0 if i i 2. Answer to question b The ridge loss function is: Y Xβ 2 2 2 λ β 2 2 = n Y i β X i, β 2 X i,2 2 2 λβ2 + β2. 2 i= Answer to question c Equate the derivative w.r.t. β to zero and obtain the estimating equation: which has the solution: X T Y Xβ + λβ = 0, βλ = X T X + λi X T Y. In this: X T X = 3 0 0 0 and X T Y = 3. Thus: ˆβλ = 9 0 0 6 3 = /3 /6. Answer to question d It shrinks the regression coefficient to as λ. The penalty now includes a target other than zero. Answer to question e The estimating equation now changes to: Solving for β yields: X T Y Xβ + λβ λ 2 = 0 2. βλ = X T X + λi X T Y + λ 2. Answer to question f First note β 2 = λ β +λ β 2. Apply the transformation-of-variables: γ = β and γ 2 = β 2. The loss function then becomes: Ỹ Xγ 2 2 λ γ, 5

where Ỹ = Y X 2. As the design matrix X is orthogonal, the loss function may be optimized w.r.t. γ and γ 2 separately. The rest is analogous to the lecture notes. Answer to question 2 Answer to question 2a Both penalized estimates regression coefficient and covariance are baised. As loss functions and penalties are different, their bias may also be rather different. It is thus unclear whether this will uphold. Answer to question 2b Estimate the inverse covariance matrix by means of maximization of the log-likelihood augmented with a lasso-type penalty. In this penalty each parameter of the precision matrix has its own penalty parameter. This parameter equals zero is the corresponding edge is in the topology, and infinity if it is not. Answers to Question 3 Answer to Question 3a The new features included may contain much less differential signal than the existing ones. When using FDR, the significance of each single gene depends also on the signal of the others. Even when the p-values of the 000 would exactly reproduce the FDR in the new experiment at the threshold used for the first experiment might be higher because V, the number of false discoveries increases proportionally with the number of tests, whereas R, the total number of discoveries, increases much less than proportionally. Hence, a smaller p-value threshold is required which potentially leads to fewer findings. Answer to Question 3b It would suffice to only test the 000 genes that were detected in the first experiment, and apply BH-FDR, because further validation will be available. Answer to Question 3c Apply an FWER criterion. Either Bonferroni or Holm for simplicitly, or Westfall-Young permutation when high correlations are expected. Answers to Question 4 Answer to Question 4a Y ijk ZI-NBµ ijk, φ i, w i, logµ ijk = α i0 + α i X j + α i2 X 2jk + β ij, β ij N0, τ 2 i Here, i : gene, j: individual, k : repeat. Moreover, φ i : overdispersion, w i : zero-inflation optional. Finally, X j : group indicator, equals 0 if individual j belongs to group, and otherwise. X 2jk : indicator for batch, 0 for batch, for batch 2. Between individual random effect is modeled using 6

β ij and Gaussian prior. Answer to Question 4b Because such a mixture may better discern the negatively and positively expressed genes from the non-expressed ones. Hence, power may increase. In addition, a mixture allows for asymmetry between negatively and positively expressed genes while maintaining the mean differential expression at 0, a simple Gaussian does not. Answer to Question 4c Note that β g i corresponds to α i in the model above First: 3 π β g i > 0.5 Y i = P C k Y i π β g i > 0.5 Y i, C k. Then: k= P C k Y i = P Y i C k P C k / k P Y i C k P C k, which are all available; P Y i C k as marginal likelihood from the fits under the separate model components and P C k from the prior. Answer to Question 4d Yes, as long as the removal is done a priori. It is wise because it may lead to a prior with better separated mixture components smaller standard deviations because the noisy features were removed, which may improve power for the relevant features. Answer to Question 4e Compute it from the old posterior, the new prior and the old prior: π new β g i Y i π old β g i Y iπ new β g i /π oldβ g i. 7