Latent Variable Methods for the Analysis of Genomic Data

Size: px
Start display at page:

Download "Latent Variable Methods for the Analysis of Genomic Data"

Transcription

1 John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data

2 Data m variables and n observations high5dimensional"variables" observa9ons" primary"vars" observa9ons" gene expression SNP genotypes intercept + treatment intercept + clinical outcome intercept + trait

3 High-dimensional regression model = + Y = B X + E Goal is to do inference on each row of B, which connects variation in X to each row of Y

4 Example steps for simultaneous inference 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR)

5 Incorporating latent variables 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR) We have been developing models, theory, and methods that replaces Y = BX + E with Y = BX + ΦM + E where M are low-dimensional latent variables.

6 Known variable + latent variable model observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M M accounts for dependence, unknown sources of systematic variation, structure, etc.

7 Gene expression studies observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n gene expression data, X d n study design matrix M batch effects, biological heterogeneity, etc. Y X, M Normal, Poisson, or NegBin

8 Genome-wide association studies observed" unobserved" unobserved,"inference"object" =" +" g(e[x Y,M]) = B Y + Φ M X m n SNP genotypes, Y 1 n trait variable M population structure X Y,M Binomial

9 Surrogate variable analysis (SVA) Introduced in Leek and Storey (2007) PLoS Genetics observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n = (y ij ) response variables where y ij R X d n study design matrix Y X, M Normal (not necessary, but helpful)

10 Generalized surrogate variable analysis (gsva) observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n response variables X d n study design matrix Y X, M exponential family distribution g( ) link function

11 Part 1. Surrogate variable analysis (SVA) and cross-dimensional inference (CDI) Leek JT and Storey JD. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3: e161. Leek JT and Storey JD. (2008) A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105: Desai KH and Storey JD. (2012) Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association, 107(497): Storey JD, Desai KH, and Chen X. (2016) Empirical Bayes inference of dependent high-dimensional data. Submitted.

12 Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij R gene expression Y d n study design

13 Motivating study: Inflammation and the Host Response to Injury Patient 1 Patient 2. Patient 168 MOF multiple organ failure score Leukocyte) mrna) Expression)) Clinical Data ~400 clinical variables

14 Poor replication Phase&1& Frequency) Frequency) Phase&2& P6value) P6value) Phase&3& Frequency) Frequency) Phase&4& P6value) P6value)

15 Accounting for latent variables Phase 1 Phase 2 Phase 3 Phase 4 Frequency) Frequency) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value) Surrogate)Variable)Analysis) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value)

16 SVA model observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n gene expression data X d n study design matrix Y X,M Normal

17 Normal model =" +" Y = BX + E e j iid N(0,Σ)

18 SVA and Cross Dimensional Inference (CDI) Leek and Storey (2007) introduce the SVA model, motivated as a way to capture unmodeled systemic variation Leek and Storey (2008) show that the SVA model also captures general forms of dependence among variables Desai and Storey (2012) study the SVA model under the Normal distribution and derive a penalized maximum likelihood approach for estimating it called cross-dimensional inference (CDI) Storey, Desai, and Chen (2016) show that CDI applied to the SVA model yields the Bayes estimator under the multivariate Normal scenario with Normal prior on B Storey, Desai, and Chen (2016) provide an empirical Bayes CDI approach for estimating the SVA model

19 Part 2. Genotype-conditional association test Song M, Hao W, and Storey JD. (2015) Testing for genetic associations in arbitrarily structured populations. Nature Genetics, 47: Hao W, Song M, and Storey JD. (2016) Probabilistic models of genetic variation in structured populations applied to global human studies. Bioinformatics, 32: Gopalan P, Hao W, Blei DM, and Storey JD. (2016) Scaling probabilistic models of genetic variation to millions of humans. Nature Genetics, in press.

20 Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij {0,1,2} SNP genotypes y = (y 1,y 2,...,y n ) trait of interest

21 Trait models Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j β i is the genetic effect of SNP i on the trait λ j is the random non-genetic effect ɛ j N(0,σ 2 ) is the random noise variation j

22 a Confounding in GWAS π 1 (z) SNP x 1 π 2 (z) SNP x 2 Structure, Lifestyle, Environment z. non-genetic λ(z) Trait y SNP x i π i (z). π m (z) SNP x m

23 Confounding in GWAS Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j x j = (x 1j,...,x m,j ) T, λ j, and σ 2 j [from ɛ j N(0,σ 2 j )] may all be functions of z j

24 Genotype conditional association test (GCAT) Perform a hypothesis test of b i = 0 vs. b i 0 in the following model: ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) 2 1 formulate and estimate a linear basis for logit(π ij ), called logistic factors 2 fit a logistic regression of SNP i trait + logistic factors 3 perform a generalized likelihood ratio test of the trait s coefficient

25 Theorem Quantitative trait: Binary trait: GC model: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) 2 ) = α + m i=1 β ix ij + λ j ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) Suppose that trait values y j are distributed according to either of the above trait models and SNPs x ij π ij Binomial(2,π ij ) as described earlier. Then β i = 0 in either trait model implies that b i = 0 in the GC model.

26 Latent variable model observed" unobserved" unobserved,"inference"object" =" +" logit(e[x Y, M]/2) = B Y + Φ M X Y,M Binomial

27 Oracle - - GCAT LMM EMMAX LMM GEMMA PCA Balding Nichols Observed False Positives Expected False Positives 0 HGDP PSD α = Spatial a = Expected False Positives TGP

28 Part 3. Jackstraw method Chung NC and Storey JD. (2015) Statistical significance of variables driving systematic variation. Bioinformatics, 31:

29 Focus on latent variable part of the model observed" unobserved" =" +" Y = Φ M + E

30 How to do inference on Φ? observed" unobserved" unobserved,"inference"object" =" +" Y = Φ M + E

31 Estimating M Let Y be row-wise mean-centered and Y = ΦM + E. SVD of Y = UDV T, where V T r is first r rows of V T. Leek (2010) Biometrics shows that under reasonable assumptions: [V ] T r V r σ 2 avgi MM T rowspace 0 as number of variables (rows of Y ) goes to for fixed sample size (columns of Y ). In other words, V T r is a reasonable estimate of M.

32 Models Utilized Y = ΦM + E = ΓV T r + E We perform inference on the rows of Γ: γ 1,...,γ m. Suppose we want to test γ i = 0 vs. γ i 0 for each i = 1,...,m.

33 Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* V T r V *T r F-statistics F-statistics

34 Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* Y* Y* Y* Y* Y* Y* V T r F-statistics V *T LF* r LF* LF* LF* LF* LF* F-statistics Compare to get p-values

35 Yeast cell cycle analysis (a) (a) (c) Minutes Principal Magnitude Component Values st PC 2nd PC Minutes Post Elucidation (b) Percent Variance Explained Principal Component

36 Part 4. Consistently estimating M Chen X and Storey JD. (2015) Consistent estimation of low-dimensional latent structure in high-dimensional data. arxiv:

37 Consistently estimating M =" E[Y M] = Φ M Y m n = (y ij ) distributed according to a single parameter exponential family distribution We show how to estimate with probability 1 the row-space of M and consistently estimate its dimension

38 Acknowledgements Funding: National Institutes of Health Office of Naval Research Co-authors: Xiongzhi Chee Chen Neo Chung Keyur Desai Wei Hao Jeff Leek Minsun Song Lab members: Andrew Bass Irineo Cabreros Chee Chen Sean Hackett Wei Hao Emily Nelson Alejandro Ochoa

Implicit Causal Models for Genome-wide Association Studies

Implicit Causal Models for Genome-wide Association Studies Implicit Causal Models for Genome-wide Association Studies Dustin Tran Columbia University David M. Blei Columbia University Abstract Progress in probabilistic generative models has accelerated, developing

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Sta$s$cs for Genomics ( )

Sta$s$cs for Genomics ( ) Sta$s$cs for Genomics (140.688) Instructor: Jeff Leek Slide Credits: Rafael Irizarry, John Storey No announcements today. Hypothesis testing Once you have a given score for each gene, how do you decide

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

An Adaptive Association Test for Microbiome Data

An Adaptive Association Test for Microbiome Data An Adaptive Association Test for Microbiome Data Chong Wu 1, Jun Chen 2, Junghi 1 Kim and Wei Pan 1 1 Division of Biostatistics, School of Public Health, University of Minnesota; 2 Division of Biomedical

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Latent Variable models for GWAs

Latent Variable models for GWAs Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986 1012 DOI: 10.1214/08-AOAS182 Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS BY DANIELA

More information

Variable selection and machine learning methods in causal inference

Variable selection and machine learning methods in causal inference Variable selection and machine learning methods in causal inference Debashis Ghosh Department of Biostatistics and Informatics Colorado School of Public Health Joint work with Yeying Zhu, University of

More information

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING Submitted to the Annals of Statistics CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING By Jingshu Wang, Qingyuan Zhao, Trevor Hastie, Art B. Owen Stanford University We consider large-scale studies

More information

In many areas of science, there has been a rapid increase in the

In many areas of science, there has been a rapid increase in the A general framework for multiple testing dependence Jeffrey T. Leek a and John D. Storey b,1 a Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD 21287; and b Lewis-Sigler

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016 CPSC 340: Machine Learning and Data Mining More PCA Fall 2016 A2/Midterm: Admin Grades/solutions posted. Midterms can be viewed during office hours. Assignment 4: Due Monday. Extra office hours: Thursdays

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Chapter 1. Modeling Basics

Chapter 1. Modeling Basics Chapter 1. Modeling Basics What is a model? Model equation and probability distribution Types of model effects Writing models in matrix form Summary 1 What is a statistical model? A model is a mathematical

More information

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 Olga (NBIS) scrna-seq de October 2017 1 / 34 Outline Introduction: what

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Probabilistic Models for Learning Data Representations. Andreas Damianou

Probabilistic Models for Learning Data Representations. Andreas Damianou Probabilistic Models for Learning Data Representations Andreas Damianou Department of Computer Science, University of Sheffield, UK IBM Research, Nairobi, Kenya, 23/06/2015 Sheffield SITraN Outline Part

More information

Correction for Hidden Confounders in the Genetic Analysis of Gene Expression

Correction for Hidden Confounders in the Genetic Analysis of Gene Expression 1 Supplemental Information for: Correction for Hidden Confounders in the Genetic Analysis of Gene Expression Jennifer Listgarten, Carl Kadie, Eric E. Schadt, and David Heckerman Proceedings of the National

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Dimensionality Reduction for Exponential Family Data

Dimensionality Reduction for Exponential Family Data Dimensionality Reduction for Exponential Family Data Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Andrew Landgraf July 2-6, 2018 Computational Strategies for Large-Scale

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

GWAS with mixed models

GWAS with mixed models GWAS with mixed models (the trip from 10 0 to 10 8 ) Yurii Aulchenko yurii [dot] Aulchenko [at] gmail [dot] com Twitter: @YuriiAulchenko YuriiA consulting 1 Outline Methodology development Speeding things

More information

Multi-level Models: Idea

Multi-level Models: Idea Review of 140.656 Review Introduction to multi-level models The two-stage normal-normal model Two-stage linear models with random effects Three-stage linear models Two-stage logistic regression with random

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Problem & Data Overview Primary Research Questions: 1. What are the risk factors associated with CHD? Regression Questions: 1. What is Y? 2. What is X? Did player develop

More information

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012 Quantile based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March 2012 2012 Yandell 1 Fisher on inference We may at once admit that any inference from the particular

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016 Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole,

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples

More information

Measures of Association and Variance Estimation

Measures of Association and Variance Estimation Measures of Association and Variance Estimation Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 35

More information

MACAU 2.0 User Manual

MACAU 2.0 User Manual MACAU 2.0 User Manual Shiquan Sun, Jiaqiang Zhu, and Xiang Zhou Department of Biostatistics, University of Michigan shiquans@umich.edu and xzhousph@umich.edu April 9, 2017 Copyright 2016 by Xiang Zhou

More information

Supplementary Information

Supplementary Information Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April

More information

Statistics 572 Semester Review

Statistics 572 Semester Review Statistics 572 Semester Review Final Exam Information: The final exam is Friday, May 16, 10:05-12:05, in Social Science 6104. The format will be 8 True/False and explains questions (3 pts. each/ 24 pts.

More information

Applied Statistics Qualifier Examination (Part II of the STAT AREA EXAM) January 25, 2017; 11:00AM-1:00PM

Applied Statistics Qualifier Examination (Part II of the STAT AREA EXAM) January 25, 2017; 11:00AM-1:00PM Instructions: Applied Statistics Qualifier Examination (Part II of the STAT AREA EXAM) January 5, 017; 11:00AM-1:00PM (1) The examination contains 4 Questions. You are to answer 3 out of 4 of them. ***

More information

Lecture 1 Introduction to Multi-level Models

Lecture 1 Introduction to Multi-level Models Lecture 1 Introduction to Multi-level Models Course Website: http://www.biostat.jhsph.edu/~ejohnson/multilevel.htm All lecture materials extracted and further developed from the Multilevel Model course

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit R. G. Pierse 1 Introduction In lecture 5 of last semester s course, we looked at the reasons for including dichotomous variables

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Package KMgene. November 22, 2017

Package KMgene. November 22, 2017 Type Package Package KMgene November 22, 2017 Title Gene-Based Association Analysis for Complex Traits Version 1.2 Author Qi Yan Maintainer Qi Yan Gene based association test between a

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Fractal functional regression for classification of gene expression data by wavelets

Fractal functional regression for classification of gene expression data by wavelets Fractal functional regression for classification of gene expression data by wavelets Margarita María Rincón 1 and María Dolores Ruiz-Medina 2 1 University of Granada Campus Fuente Nueva 18071 Granada,

More information

Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Estimating subgroup specific treatment effects via concave fusion

Estimating subgroup specific treatment effects via concave fusion Estimating subgroup specific treatment effects via concave fusion Jian Huang University of Iowa April 6, 2016 Outline 1 Motivation and the problem 2 The proposed model and approach Concave pairwise fusion

More information

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018 LARGE NUMBERS OF EXPLANATORY VARIABLES HS Battey Department of Mathematics, Imperial College London WHAO-PSI, St Louis, 9 September 2018 Regression, broadly defined Response variable Y i, eg, blood pressure,

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies Confounding in gene+c associa+on studies q What is it? q What is the effect? q How to detect it?

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights Koen Van den Berge, Ghent University Statistical Genomics, 2018-2019 1 The team Koen Van den Berge Fanny

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information