LASR 2015 Geometry-Driven Statistics and its Cutting Edge Applications: Celebrating Four Decades of Leeds Statistics Workshops

Similar documents
Approaches to Modeling Menstrual Cycle Function

Joint work with Nottingham colleagues Simon Preston and Michail Tsagris.

Discussion Paper No. 28

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics

Math 120: Precalculus Autumn 2017 A List of Topics for the Final

Fractal functional regression for classification of gene expression data by wavelets

Directional Statistics

Computational Biology: Basics & Interesting Problems

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Simultaneous Estimation in a Restricted Linear Model*

L'horloge circadienne, le cycle cellulaire et leurs interactions. Madalena CHAVES

Estimation of cumulative distribution function with spline functions

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Bayesian Inference for Clustered Extremes

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Post-selection Inference for Changepoint Detection

Integrated Mathematics I, II, III 2016 Scope and Sequence

Rule of Thumb Think beyond simple ANOVA when a factor is time or dose think ANCOVA.

Discovering molecular pathways from protein interaction and ge

Predicting Protein Functions and Domain Interactions from Protein Interactions

Molecular and cellular biology is about studying cell structure and function

Biostat 2065 Analysis of Incomplete Data

Multivariate and Time Series Models for Circular Data with Applications to Protein Conformational Angles

Introduction to Bioinformatics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

BIOLOGY IB DIPLOMA PROGRAM PARENTS ORIENTATION DAY

ANALYSIS OF NONLINEAR REGRESSION MODELS: A CAUTIONARY NOTE

Specifying Latent Curve and Other Growth Models Using Mplus. (Revised )

Example 9 Algebraic Evaluation for Example 1

Probabilistic Reasoning in Deep Learning

Exam: high-dimensional data analysis January 20, 2014

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species?

Matteo Figliuzzi

Lecture 10: Cyclins, cyclin kinases and cell division

Basic math for biology

An Information Criteria for Order-restricted Inference

PARAMETER UNCERTAINTY QUANTIFICATION USING SURROGATE MODELS APPLIED TO A SPATIAL MODEL OF YEAST MATING POLARIZATION

Mathematics Grade 11 PA Alternate Eligible Content

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

COMPARISON OF MODE SHAPE VECTORS IN OPERATIONAL MODAL ANALYSIS DEALING WITH CLOSELY SPACED MODES.

Chapter 15 Active Reading Guide Regulation of Gene Expression

Model Selection. Frank Wood. December 10, 2009

Unit 1: Statistical Analysis. IB Biology SL

Obnoxious lateness humor

Probability theory and mathematical statistics:

By Bhattacharjee, Das. Published: 26 April 2018

Curriculum Map for Mathematics SL (DP1)

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Course plan Academic Year Qualification MSc on Bioinformatics for Health Sciences. Subject name: Computational Systems Biology Code: 30180

Complex Networks, Course 303A, Spring, Prof. Peter Dodds

A. H. Abuzaid a, A. G. Hussin b & I. B. Mohamed a a Institute of Mathematical Sciences, University of Malaya, 50603,

Bayesian estimation of the discrepancy with misspecified parametric models

STA 414/2104, Spring 2014, Practice Problem Set #1

the long tau-path for detecting monotone association in an unspecified subpopulation

1. (10pts) If θ is an acute angle, find the values of all the trigonometric functions of θ given that tan θ = 1. Draw a picture.

Academic Outcomes Mathematics

Model Accuracy Measures

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Power and Sample Size Calculations with the Additive Hazards Model

Graph Alignment and Biological Networks

MATH 32 FALL 2012 FINAL EXAM - PRACTICE EXAM SOLUTIONS

On Development of Spoke Plot for Circular Variables

Comparing transcription factor regulatory networks of human cell types. The Protein Network Workshop June 8 12, 2015

AP CALCULUS BC Syllabus / Summer Assignment 2015

Fast approximations for the Expected Value of Partial Perfect Information using R-INLA

Quantifying Fingerprint Evidence using Bayesian Alignment

Bickel Rosenblatt test

Variability within multi-component systems. Bayesian inference in probabilistic risk assessment The current state of the art

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

CHAPTER ONE FUNCTIONS AND GRAPHS. In everyday life, many quantities depend on one or more changing variables eg:

Learning in Bayesian Networks

Mathematics AIMM Scope and Sequence 176 Instructional Days 16 Units

Region 16 Board of Education. Precalculus Curriculum

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Weighted differential entropy based approaches to dose-escalation in clinical trials

Curriculum Map for Mathematics HL (DP1)

Example 2.1. Draw the points with polar coordinates: (i) (3, π) (ii) (2, π/4) (iii) (6, 2π/4) We illustrate all on the following graph:

ST MARY S DSG, KLOOF GRADE: SEPTEMBER 2017 MATHEMATICS PAPER 2

Linear Methods for Prediction

STRING: Protein association networks. Lars Juhl Jensen

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

Published: 26 April 2017

Algorithmic approaches to fitting ERG models

Bootstrap Goodness-of-fit Testing for Wehrly Johnson Bivariate Circular Models

Optimal rules for timing intercourse to achieve pregnancy

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Linear Regression (1/1/17)

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Complex Graphs and Networks Lecture 3: Duplication models for biological networks

Higher Order Singularities in Piecewise Linear Vector Fields

Model Selection and Geometry

Case Study: Protein Folding using Homotopy Methods

SPA for quantitative analysis: Lecture 6 Modelling Biological Processes

I. Specialization. II. Autonomous signals

Covariance function estimation in Gaussian process regression

Transcription:

LASR 2015 Geometry-Driven Statistics and its Cutting Edge Applications: Celebrating Four Decades of Leeds Statistics Workshops Programme and Abstracts Edited by K.V. Mardia, A. Gusnanto, C. Nooney & J. Voss 30th June to 2nd July 2015

Big Data in Human Genomics/Proteomics: The amount of data generated in our world has increased exponentially so that we now live in the world of Big Data. A new field has arisen that is called Data Analytics/Data Science. In genomic science, we have 23 chromosomes leading to 20,000-25,000 genes, 3-billion-letter (ATCG) in the human genome project, millions of protein sequences, more than 100,000 protein structures, each with many large number of variables. There are many challenges from drug discovery to stem cell research. Image Courtesy of U.S. Human Genome Project and Professor Gabriela C. Freue. Copyright 2015 K.V. Mardia, A. Gusnanto, C. Nooney & J. Voss. Department of Statistics, University of Leeds, U.K.

Geometry-Driven Statistics and its Cutting Edge Applications: Celebrating Four Decades of Leeds Statistics Workshops International Conference, held in Leeds, UK, 30 June-2 July 2015, The 33rd Leeds Annual Statistical Research (LASR) Workshop Sponsored and organised by the Department of Statistics, University of Leeds. Proceedings edited by K.V. Mardia, A. Gusnanto, C. Nooney & J. Voss Department of Statistics, University of Leeds, Leeds, LS2 9JT. Conference Organisers K.V. Mardia (Chairman), A. Gusnanto, J. Voss, C. Nooney, J.M. Brennan, R. Aykroyd, W.R. Gilks & J.T. Kent Copyright 2015 K.V. Mardia, A. Gusnanto, C. Nooney & J. Voss, Department of Statistics, University of Leeds, U.K. ISBN 978 0 85316 338 1

Circular Isotone Regression with an Application to Cell-cycle Biology. Cristina Rueda 1, M.A. Fernández 1, S. Barragán 1, K.V. Mardia 2 and S.D. Peddada 3 1 Department of Statistics and OR, Universidad de Valladolid, Spain 2 Department of Statistics, University of Oxford, Oxford, UK, and Department of Statistics, University of Leeds, Leeds, UK 3 Biostatistics Branch, NIEHS (NIH), Research Triangle Park, NC, USA 1 Introduction This work is motivated by a problem encountered in Molecular Biology where researchers are interested in correlating angular data from two oscillatory systems. The observations are the time to peak expression (also known as phase angle) of periodic genes under two different conditions (dose levels, organs or even species). In particular, we deal here with expression data from genes participating in the cell-cycle. Cell-biologists are often interested in drawing inferences regarding the phase angle of cell-cycle genes since they are considered to be associated with the gene s biological function (Jensen et al 2006). Several distinctive features should be taken into account to derive a correct model for this application. First, since the cell division cycle is a carefully orchestrated and periodic process, the peak expressions of cell-cycle genes follow an order according to their functions and the same order should apply for the tow different conditions (which will play the role of the response and the explicative variables). Then, the model should assure that the response must run exactly one cycle as the explicative runs one cycle without moving back. Second, since cells goes through 4 phases with different biological functions and even with different lengths across species, the model approach should be flexible to deal with possible different correlations from phase to phase. While the regression model proposed in Downs and Mardia (2002) is likely to perform well when the duration of time spent by a cell in different phases of a cell-cycle is same across all species, it may be too rigid when the duration of time is not same across different species as the lengths of the four phases in which the cell-cycle is divided change from one species to another, so that the functional relationship between species may be different in each of the phases. On the other side, the non-parametric alternatives, in particular the proposal developed in Di Marzio et al (2013) do not always fulfill the two important conditions, commented above, that the regression should verify: monotonicity and synchronicity. In this paper we develop regression models able to deal with these features. In particular, we introduce a general isotonic regression model and a flexible piecewise regression model that can be useful for drawing inferences when the duration of time spent in different phases by a cell varies across species. 2 Circular Isotonic Regression models Let (ψ i, θ i ), i = 1,..., n denote a random sample from the circular response Ψ and from the circular independent variable Θ. Assume that the ψ i values come from independent von Mises 101

distributions M(φ i, κ). 2.1 The general Isotonic Model The CIRE (Circular Isotonic Regression Estimator) of Ψ is defined as Φ (O) = arg min Φ C Θ SCE(Ψ, Φ), where C Θ is the circular order induced by the independent Θ variable, C Θ = {Φ [0, 2π) n : φ a φ b φ c φ a θ a θ b θ c θ a }. The CIRE exists, is almost sure unique, may be obtained from circular means of adjacent angles and is the restricted MLE under the Von-Mises model (see Rueda et al 2009). The Circular Isotonic regression Model is simply defined as Φ = f(θ) where f preserves the order induced by Θ. 2.2 A Piecewise Isotonic Model Consider k different sectors (pieces) in the independent variable. Then, we have (ψ ij, θ ij ) with i = 1,..., k and j = 1,..., n where n i is the number of observations in sector i, and θi, i = 1, 2,..., k are the sector borders or change points with θi < θ ij θi+1 and θk+1 = θ 1. The Piecewise Circular-Circular Model is defined as ( φ ij = µ + 2 arctan ω i tan 1 ) 2 (θ ij ν i ), subject to, ω i tan 1 2 (θ i ν i ) = ω i 1 tan 1 2 (θ i ν i 1 ) for i = 1,..., k, where ν 0 = ν k, ω 0 = ω k, µ is a global location parameter quantifying the rotation of the response that allows a better congruence with the independent variable; ν i is the location parameter in sector (θi, θi+1) and ω i is the slope parameter in the sector (θi, θi+1). In the particular application of cell-cycle data, the sectors are determined by four phases of the cell-cycle and the estimation problem is solved via maximum likelihood subject to the following restrictions: Continuity Restrictions: ( ) θ tan i+1 ν i+1 2 ω i = ( ) θ ω tan i+1 ν i+1 for i = 1,..., k 1. i Monotonicity Restrictions: 2 ω i 0 for i = 1,..., k. Synchronicity Restrictions: ( 1 Let z i = ν i + 2 arctan ω i tan ( ) ) µ for i = 1,..., k, be a possible zero of ith piece of the 2 function. Then { z i : z i ( θ i, θ i+1]} = 1. This model is presented in the forthcoming paper Rueda et al (2015). 102

3 Application We have analyzed data from 32 periodic genes in two species of yeast, S. cerevisiae and S. pombe, considering 2 experiments from S. cerevisiae that we denote as Ca (which is the explicative in all models considered) and Cb (which is used as response in one of the models) and 2 experiments from S. pombe that we denoted as Pa and Pb (used as responses in the other two models presented here). We have fitted the Circular-Circular regression model from Downs and Mardia (2002), the nonparametric circular model from Di Marzio et al (2012), and the two models defined in section 2. The statistics used to select between models are, the Circular Distance Criterion CDC(M), which is a sort of lack of fit criterion and is defined as CDC(M) = 1 k ni N i=1 j=1 e ij, where e ij = 1 (cos(ψ ij ψ ij )) and a Generalized Akaike Information measure (GAIC) that is defined as GAIC(M) = 2 ln(l(m)) 2GDF, where l(m) is the model likelihood and GDF is a penalization factor obtained as the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value. The main advantage of this measure is that is applicable to complex modeling procedures including nonparametric and restricted models (see Ye 1998 and Zhang et al 2012). Full details of this criterion appear in Rueda et al (2015). The goodness of fit statistics for the different regression models are given in the following table. Full interpretation of these results together with the appropriate graphs will be given during the talk. Experiment Statistic Parametric Piecewise Isotone Non-Parametric Pa/Ca CDC 0,3938 0,3323 0,2688 0,1671 Pb/Ca CDC 0,3682 0,1657 0,2289 0,1970 Cb/Ca CDC 0,1570 0,1461 0,0806 0,1292 Pa/Ca GAIC 19,4116 19,8093 22,4058 31,1877 Pb/Ca GAIC 23,4580 37,6419 46,4377 36,4998 Cb/Ca GAIC 55,9770 55,4740 61,9504 57,4613 4 Conclusions Restricted regression models have proved very useful to describe relationships between cellcycle data expressions from different species. In particular, the piecewise model has several interesting advantages for this application. It is simple and interpretable, flexible enough to describe different correlations depending on the sector and versatile, as it can handle restrictions of monotonicity and synchronicity. Related problems arise in other fields such as in circadian biology, metabolic cycle, evolutionary psychology or motor behavior. Other application where the piecewise model will be useful is for characterizing patterns of hormones during the menstrual cycle (with three distinct phases: follicular, ovulation and luteal). References Di Marzio, M., Panzera, A., Taylor, C.C. (2013). Non-parametric Regression for Circular Responses. Scandinavian Journal of Statistics, 40, 238-255. Downs, T.D. and Mardia, K.V. (2002). Circular Regression, Biometrika, 89, 683-697. 103

Jensen, J.L., Jensen T.S., Lichtenberg, U., Brunak, S. and Bork, P. (2006). Co-evolution of transcriptional and post-translational cell-cycle regulation. Nature, 443, 594-597. Rueda, C., Fernández, M.A. and Peddada, S. (2009). Estimation of Parameters Subject to Order Restrictions on a Circle with Application to Estimation of Phase Angles of Cell- Cycle Genes Journal of the American Statistical Association, 104, 338-347. Rueda, C., Fernández, M.A., Barragán, S., Mardia, K.V. and Peddada, S. (2015). Circular Piecewise Regression with an Application to Cell-cycle Biology. Preprint. Ye, J. (1998). On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association, 93, 120-131. Zhang, B., Shen, X. and Mumford, S.L. (2012). Generalized Degrees of Freedom and Adaptive Model Selection in Linear Mixed-Effects Models. Computational Statistics and Data Analysis, 56, 574-586. 104