Principal Variance Components Analysis for Quantifying Variability in Genomics Data

Similar documents
Principal component analysis (PCA) for clustering gene expression data

Principal component analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimension Reduction (PCA, ICA, CCA, FLD,

Machine Learning 11. week

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

CS229 Final Project. Wentao Zhang Shaochuan Xu

G E INTERACTION USING JMP: AN OVERVIEW

Experimental Design and Data Analysis for Biologists

Introduction to Machine Learning

Eigenvalues, Eigenvectors, and an Intro to PCA

Simplifying Drug Discovery with JMP

What is Principal Component Analysis?

Eigenvalues, Eigenvectors, and an Intro to PCA

Course in Data Science

5. Discriminant analysis

Functional Latent Feature Models. With Single-Index Interaction

Principal Components Analysis (PCA)

Machine learning for pervasive systems Classification in high-dimensional spaces

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Principal Component Analysis

Multivariate Statistical Analysis

Chapter 4: Factor Analysis

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Convergence of Eigenspaces in Kernel Principal Component Analysis

Vector Space Models. wine_spectral.r

CS281 Section 4: Factor Analysis and PCA

Grouping of correlated feature vectors using treelets

Eigenfaces. Face Recognition Using Principal Components Analysis

1 Principal Components Analysis

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Feature Engineering, Model Evaluations

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

PCA, Kernel PCA, ICA

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

20 Unsupervised Learning and Principal Components Analysis (PCA)

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Machine Learning (BSMC-GA 4439) Wenke Liu

DIMENSION REDUCTION AND CLUSTER ANALYSIS

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Wolfgang Karl Härdle Leopold Simar. Applied Multivariate. Statistical Analysis. Fourth Edition. ö Springer

Inverse Power Method for Non-linear Eigenproblems

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Machine Learning 2nd Edition

Principal Component Analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Overview of clustering analysis. Yuehua Cui

Learning gradients: prescriptive models

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Computational Genomics

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Principal component analysis, PCA

A Semi-Supervised Dimensionality Reduction Method to Reduce Batch Effects in Genomic Data

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 6: Methods for high-dimensional problems

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Theorems. Least squares regression

Statistical Methods for Analysis of Genetic Data

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Stat 5101 Lecture Notes

EECS 275 Matrix Computation

A Peak to the World of Multivariate Statistical Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Online supplement. Absolute Value of Lung Function (FEV 1 or FVC) Explains the Sex Difference in. Breathlessness in the General Population

Chapter 8: Regression Models with Qualitative Predictors

CMSC858P Supervised Learning Methods

Pattern Recognition and Machine Learning

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Final Exam

Interpolation via Barycentric Coordinates

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Signal Analysis. Principal Component Analysis

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods

Canonical Correlations

Fisher s Linear Discriminant Analysis

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Motivating the Covariance Matrix

Neuroscience Introduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Microarray Data Analysis - II. FIOCRUZ Bioinformatics Workshop 6 June, 2001

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Factor Analysis Continued. Psy 524 Ainsworth

PCA and admixture models

ECE 592 Topics in Data Science

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Dimension Reduction and Classification Using PCA and Factor. Overview

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Spectral Methods for Subgraph Detection

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Transcription:

Principal Variance Components Analysis for Quantifying Variability in Genomics Data Tzu-Ming Chu SAS Institute Inc.

Outlines Motivation PVCA (PCA + VCA) Example Mouse Lung Tumorigenicity Data Grouped Batch Profile Normalization (GBP) Discussion

Motivation - ANOVA Estimating variations of variance components among SAMPLEs ANOVA Some ANOVA approaches applied in microarrays - Kerr et al. 2000 (Full-genes Model) - Wolfinger et al. 2001 (Gene-by-Gene Mixed Model) - Chu et al. 2002 (Probe-Level Mixed Model) - Feng et al. 2006 (Empirical Bayesian Mixed Model)

Motivation - ANOVA ANOVA issues Lack of gene by gene correlations Quantification of variations among genes Y = Crspding_Ctrl_Grp + Treatment + Status + Residual

Principal Variance Components Analysis PCA the foundation of PVCA Variance-Covariance Matrix Mutually orthogonal among principal components VCA Mixed Model on individual principal component Weighted proportion of variations Eigen values Refer to Li et al. 2009, www.wiley.com/wileycda/wileytitle/productcd- 0470741384.html

PVCA - PCA Projection onto eigenvectors ΣP = Σ[P 1,P 2,, P n ] = [ΣP 1, ΣP 2,,ΣP n ] = [λ 1 P 1, λ 2 P 2,,λ n P n ] Σ Eigenvalue λ i represents the variance associated with the i th principal component P 1 λ 1 P 2 λ 2

PVCA Analysis Flow Chart PCA analysis on the covariance/correlation matrix Retain the eigenvalues Select the first few Principal Components Estimate variance component by a mixed model to each principal component Compute the eigenvalue-weighted average of VC estimates obtained from each model Standardize all the averaged VC estimates including model residuals, and present them as the proportions of total variance

Mouse Lung Tumorigenicity Data Develop microarray-based biomarkers to predict the results from a rodent cancer bioassay following a short term chemical exposure (NTP 1996) Data from www.ncbi.nlm.nih.gov (GSE17933) 26 chemicals from 4 studies over 3 years (2005-7) 158 Affymetrix Mouse Genome 420 2.0 GeneChips One of the six MAQCII data

Batch Mouse Lung Tumorigenicity Data Animal Study ID Animal Batch Vehicle Group a Controls CA NCA Compounds 1 5003 1 FCON 1 1 2 4 2 5003 1 CCON 1 1 0 2 3 6004 1 FCO2 1 1 3 5 4 6004 2 CCO2 1 2 0 3 5 6004 3 CCO3 1 1 0 2 6 6004 4 WCO1 1 1 1 3 7 6014 1 CCO4 1 0 2 3 8 6014 2 FCO1 1 2 0 3 9 6014 3 ACO1 1 0 2 3 10 7008 1 CCO5 1 0 1 2 11 7008 2 ACO3 1 1 1 3 12 7008 3 CCO6 1 1 0 2 13 7008 4 ACO4 1 2 0 3 14 7008 5 ACO5 1 1 0 2 a The first three letters in the abbreviation define the vehicle and route used for administration (FCO = food; CCO = corn oil by gavage; WCO = water by gavage; ACO = air by inhalation). The last number or letter refers to the specific batch the vehicle was used. CA=Carcinogenicity.

PVCA Results Y = Crspding_Ctrl_Grp + Study_ID+ Treatment + Status + Residual Average across 12 principal components 80.95% of total variation

Grouped Batch Profile (GBP) Normalization Batch profile estimation Based on control arrays Batch profile grouping Clustering on similar profiles Batch profile correction Batch profile scoring

GBP Batch Profile Estimation Estimation based on control arrays Assuming additive model Y = α(s) + β(b) + e Model for Control Arrays Y c = α(c) + β(b) + e For simplicity Y ci = µ c + β i + e Estimated Batch Profile

GBP Profile Clusters Clustering on standardized profiles

GBP Batch Profile Correction Extract mean profile from each cluster to be the representative profile, Correct the corresponding representative profile from data Scoring new batches

PVCA GBP Results Before After Average across 19 principal components 80.96% of total variation

Prediction of Carcinogenicity 84 prediction models from 8 types of models (Discriminant Analysis, Distance Scoring, General Linear Model, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, and Radial Basis Machine) Randomly hold out 3 batches from 14 batches as validation data Repeat 50 times and assess on Accuracy, AUC, and RMSE Apply on 3 normalized data (MMA,RMA,GBP) Refer to Chu et al. 2009 for details

Cross Validation Results

Discussion PVCA provides estimations of variance component among arrays without excluding covariance between genes PVCA can be a useful tool for exploring data PVCA can be a good assessment tool for normalization to quantify how much of variation changed across variance components R and SAS code is available at www.niehs.nih.gov/research/resources/software/ pvca

Acknowlegements Jianying Li (UNC/Chapel Hill) Pierre Bushel (NIEHS) SAS Colleagues Russ Wolfinger Wenjun Bao Li Li Shannon Conners