Data Preprocessing. Data Preprocessing

Size: px
Start display at page:

Download "Data Preprocessing. Data Preprocessing"

Transcription

1 Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision. 2 1

2 Sources of variation in the data Interesting variation: e.g., differentially expressed genes between disease and normal tissues. Obscuring variation: Technical: sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the lab instrument (scanner) precision Different platforms Biological: Different growth conditions, heterogeneity of samples, stochastic nature of biology. 3 Steps that introduce variation (noise) Tissues 4 N Engl J Med, 354: 2463,

3 Measuring variations Analysis of duplicated/replicated experiments at different levels can be used to assess the different sources of variation. Biological replicates: samples from the same biological state Technical replicates: splitting a single sample into several parts. Can be done at different stages of the protocol. Biological variation >> Technical variation 5 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision. 6 3

4 cdna Normalization cdna and oligonucleotide arrays have different normalization needs due to different sources of noise. cdna: Since the two dyes (Cy3 and Cy5) are unbalanced (different efficiency), each channel is normalized separately and then combined. Normalization methods for the two technologies share similar concepts but often tools are dedicated to a single technology. 7 Input: Raw signal matrix CEL files (probe-level data) Signal extraction from single experiments e.g. using MAS5 Genes (Probesets) g 1 g 2 g m e 1 e 2 e n Normalization 8 4

5 Data acquisition and preprocessing are often united Data acquisition microarray processing Data preprocessing scaling/normalization/filtering Common signal extraction methods such as dchip, RMA (and gcrma) combine signal extraction and preprocessing. Pros: Normalization can be done at the probe level. Use statistics on a set of samples to identify outlier probes. Cons: Generates a dependency between the samples. Example: Adding/Removing samples requires to rerun the signal extraction part. 9 Scaling Common sources of variation yield readings at different scales. Can be caused by hybridizing different amounts of RNA, different efficiency of the labels, different scanning conditions The distortion can be non-linear (e.g. due to saturation effects). Note: cdna chips often encounter stronger nonlinear distortions. 10 5

6 Example 10 biological replicates: MCF7 cells grown in 0.1% DMSO taken from the connectivity map experiment (Lamb, Science 2006) run on U133A Affymetrix chips. Signal extracted with MAS5 250 Histogram of mean expression 200 mean expression fold difference between extremes Scatter plot Comparing samples using a scatter plot 8000 Sample #7 mean= Sample #5 mean=

7 Normalization/Scaling methods Linear scaling Invariant set normalization Quantile normalization cdna: Non-linear scaling (loess, splines, etc.) 13 Normalization methods linear scaling Assumption: same overall chip intensity across samples. 14 7

8 Normalization methods y = x linear scaling Assumption: same overall chip intensity across samples. Transformation: fitting a linear relationship w/ zero intercept. Reference sample: typically the one with median mean value. Before After x' i = x i Y X, i =1,2,..., p X = mean(x), Y = mean(y) 15 Normalization methods Invariant Set Normalization (used by dchip) Assumption: Many genes are kept unchanged and transformation is linear Method: Identify genes whose ranks are relatively constant (e.g. std. of rank<10). Use the mean of these genes to linearly scale the samples. Repeat several times, until converges. Used by dchip at probe level. 16 8

9 Normalization methods quantile normalization (used by RMA) Assumption: Measured expression grows monotonically with true level of expression Method: (RMA uses quantile normalization at probe level) Transform data so that the quantile-quantile plot for any two arrays is the straight identity line. Take the mean quantile (across samples), and susbtitute it as the value of the data item in the original dataset. probes samples sort by column replace w/ row averages rearrange in original order 17 (color corresponds to rank) MAS5: Std(log 2 ) vs. Mean(log 2 ) Log 2 -transformation: Noise nearly constant at all levels Std of log Mean of log 9

10 RMA: Std(log 2 ) vs. Mean(log 2 ) RMA: Less variation, more constant across values Std of log Mean of log 19 Normalization assumptions summary Same overall intensity (or, same distribution) for different arrays. Measured expression grows linearly with true level of expression (or at least monotonically) Gene-specific noise is multiplicative (additive in the log-scale). Log 2 -transform transform noise to be independent of mean We typically use RMA or MAS5. Danger: Make sure these assumption holds in your experiment. For example: stem cells have higher overall expression than differentiated cells

11 Questions 21 Data Preprocessing Normalization: the process of removing sample-to to- sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision

12 Why filtering? Small N, large P implies vulnerability to overfitting (modeling noise). Try to reduce the number of hypothesis (therefore, the number of false negatives) 23 Filtering methods 2.5 Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise 10 envelope: 4 define noise envelope RMA based on replicates, select genes whose 2 variation is larger than the envelope. MAS Std of log Gene selection based on reproducibility: 2 need for duplicates. 1 Std Mean of log Mean 24 12

13 Filtering methods Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise envelope: define noise envelope based on replicates, select genes whose variation is larger than the envelope. Gene selection based on reproducibility: need for duplicates. 25 Filtering based on noise envelope 26 Here: Estimate set of 14 a Stratagene noise envelope samples based on replicate data log( σ g ) = α + β log( μg ) + ε, ε ~ N(0, σ ) Compute Super-impose gene-specific envelope mean on and stdevdata on data to be to filtered be filtered σ + ( new) g? ˆ ( new) ˆ α β log( μg ) 95% 13

14 Filtering methods Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise envelope: define noise envelope based on replicates, select genes whose variation is larger than the envelope. Gene selection based on reproducibility: need for duplicates. 27 Filtering based on duplicates Look at duplicates (sample pairs) Select genes whose expression across duplicates correlates best

15 Filtering based on reproducibility given gene i (duplicate pair), i=1,,n experiment 1 experiment 2 experiment n gene i (duplicate 1) g 11 g 12 g 1N gene i (duplicate 2) g 21 g 22 g 2N experiment n duplicate 2 experiment 2 experiment 1 29 duplicate 1 Spread bad good F statistic maximizing correlation and spread Correlation good bad B F = W? > 1 30 overall mean group mean ( gi g ) ( gij gi ) 2 2 i #Groups j Group i #Groups j Group i i B = W = #Groups-1 #Samples #Groups Between-groups variation Within-group variation 15

16 Best/worst markers 31 DLBCL dataset [Blood, 105(5): ] References 1. Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, (4s): p The Tumor Analysis Best Practices Working Group, Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nature Reviews Genetics, (3): p Dudoit, S., et al., Statistical Methods for Identifying Differentially Expressed Genes in Replicated cdna Microarray Experiments. Statistica Sinica, (1): p Hartemink, A.J., et al., Maximum Likelihood Estimation of Optimal Scaling Factors for Expression Array Normalization, in SPIE International Symposium on Biomedical Optics (BiOS01), M. Bittner, et al., Editors p Irizarry, R.A., et al., Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, (2): p Many more 32 16

17 Questions 33 Visualization all steps can benefit from visualization Visualization 34 17

18 Heat Map PCA SVD MDS Visualization NMF (discussed in the clustering session) 35 Visualization of E: Heatmap Raw data (RMA, log 2 scale) L E= M e ij M M L L ALL MLL AML 500 most variable genes (clustered) 36 Leukemia data from Armstrong et al. Nat. Genet. (2002) 18

19 AML/MLL/ALL Heatmap Row centered and normalized data x11 M L xij X= M M eij μi xij = σ i L L x mn ALL MLL AML 500 most variable genes (clustered) 37 Pro: Clearly see differential expression Con: Loose absolute value AML/MLL/ALL 3D Heatmap Raw z-axis and color ALL MLL AML 38 19

20 AML/MLL/ALL 3D Heatmap Raw z-axis, row centered and normalized color Color z-axis ALL MLL AML 39 Dimensionality Reduction Our brain is a good pattern recognition tool Problem: We are used to handle only 2 (or 3) dimensions Solution: Dimensionality reduction 40 20

21 Visualization of genes or samples Aim: Given x 1:n R d find y 1:n R k (typically k<<d) that capture some properties of x 1:n Methods: Principal Component Analysis (PCA) Projection onto a low-dimensional hyper-plane. Singular Value Decomposition (SVD) Approximate a matrix by a sum of simple matrices Multi-dimensional Scaling (MDS) Mapping points such that preserve distances Graph layout methods (e.g. springs and charges) Independent Component Analysis (ICA) Projection pursuit Caution: Our brain also tends to find patterns in random data (over-fitting) 41 Principal Component Analysis (PCA) Aim: Find a low k-dimensional hyper-plane on which the variation of the projected data is maximal. V=(v Matrix multiplication 1 v 2 ) v 2 v 2 σ 2 1 An mbm k = Cn k m v 1 v σ 2 21 c = a b x i J(V)=σ 12 +σ 2 2 y i =V T (x i -μ) μ=nx i /n i ij α = 1 j iα αj = i j Objective: Find V that maximizes J(V) Equivalent to: Find low k-dimensional hyper-plane such that the projected data best approximate the original data

22 Incremental Building of the k Principal Components Algorithm: (not the one actually used) Loop i from 1:k Find the direction v i along which the variance σ i 2 is maximal Remove from each point its projection on v i Principal components V=(v 1,, v k ), captured variances {σ i2 } 1:k The projected data y i = V T (x i -μ) The fraction of variance that is captured by the principal components, c k, measures how well the projected data approximates that original data Captured variance c k s k =n i=1,...,k σ i 2 c k =s k /s d 43 k = # of PC Input PCA of leukemia samples Output v 1 v 2 v 3 AML MLL ALL Genes 44 22

23 Pitfalls of PCA Largest variance most informative: 2 pancakes Interesting direction Direction with largest variance Structure in low-dimensional space there is structure in the full space. But NOT 45 Singular Value Decomposition (SVD) Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample meta-gene) + + = ALL MLL AML SVD # SVD # SVD #1 23

24 SVD of leukemia data AML ALL MLL s 1 v 1 u 1 T s 2 v 2 u 2 T s 3 v 3 u 3 T SVD of leukemia data = AML ALL MLL s 1 v 1 u 1 T s 2 v 2 u 2 T s 3 v 3 u 3 T 24

25 Singular Value Decomposition (SVD) Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample meta-gene) E Σ i=1:k s i v i u i T + + where {v i },{u i } are orthogonal unit vectors Objective function: J({s i },{v i },{u i })= Σ ij (e ij -Σ α=1:k s α v iα u jα ) 2 Method: Unique solution based on diagonalizing EE T Note: {v i } are the same as in PCA of the samples if the genes are centered Clustering: Identify elements with large absolute value as members of clusters 49 Multi-dimensional Scaling (MDS) Aim: Find a low k-dimension representation of the data such that best preserves the distance matrix of the original data 50 25

26 Multi-dimensional Scaling (MDS) Aim: Find a low k-dimension representation of the data such that best preserves the distance matrix of the original data δ ij = x i -x j d ij = y i -y j x i y i Objective: Find y 1:n that minimize J(δ ij,d ij ). J(δ ij,d ij ) measures how well d ij approximates δ ij. Method: Gradient descent 51 Objective functions Different ways to measure similarity between distance matrices: Emphasize large differences ( d ) i< j ij δ ij δ ij J ee = 2 i< j 2 Emphasize fractional differences J ff = i < j d δ ij δ ij ij

27 Gradient Descent Aim: Find minimum of J(a) Method: Init: a (0) a random position Iterate: a (t+1) a (t) -η J(a (t) ) Stop: when Da (t+1) - a (t) D<ε or t>t Gradient descend J ( x) x1 J ( x) x2 J ( a) = M x J ( ) x d x = a Problem: Finds a local minimum depending on the starting point (according to basin of attraction) See also: Newton s algorithm, Conjugate gradient 53 Input: MDS for leukemia samples Used J ee with Euclidean distance Output: 54 27

28 MDS vs. PCA PCA MDS Linear Yes Distorts space Unique Yes Depends on initial configuration Optimal Yes Converges to local minima Preserves distances Only projected part Yes (attempts to) Captures high-dimensional structure Missing dimensions Potentially better 55 References 1. Duda, Hart and Stork, Pattern Classification. Wiley & Sons Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, (4s): p Allison, D.B. et al. Microarray data analysis: from disarray to consolidation and consensus, Nat. Rev. Gent (7): p

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Microarray Preprocessing

Microarray Preprocessing Microarray Preprocessing Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact.

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley Memorial Sloan-Kettering Cancer Center July

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Session 06 (A): Microarray Basic Data Analysis

Session 06 (A): Microarray Basic Data Analysis 1 SJTU-Bioinformatics Summer School 2017 Session 06 (A): Microarray Basic Data Analysis Maoying,Wu ricket.woo@gmail.com Dept. of Bioinformatics & Biostatistics Shanghai Jiao Tong University Summer, 2017

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

SPOTTED cdna MICROARRAYS

SPOTTED cdna MICROARRAYS SPOTTED cdna MICROARRAYS Spot size: 50um - 150um SPOTTED cdna MICROARRAYS Compare the genetic expression in two samples of cells PRINT cdna from one gene on each spot SAMPLES cdna labelled red/green e.g.

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016 CPSC 340: Machine Learning and Data Mining More PCA Fall 2016 A2/Midterm: Admin Grades/solutions posted. Midterms can be viewed during office hours. Assignment 4: Due Monday. Extra office hours: Thursdays

More information

Microarray data analysis

Microarray data analysis Microarray data analysis September 20, 2006 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@kennedykrieger.org Johns Hopkins School of Public Health (260.602.01) Copyright notice Many of

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

c Springer, Reprinted with permission.

c Springer, Reprinted with permission. Zhijian Yuan and Erkki Oja. A FastICA Algorithm for Non-negative Independent Component Analysis. In Puntonet, Carlos G.; Prieto, Alberto (Eds.), Proceedings of the Fifth International Symposium on Independent

More information

Independent Component Analysis and Its Application on Accelerator Physics

Independent Component Analysis and Its Application on Accelerator Physics Independent Component Analysis and Its Application on Accelerator Physics Xiaoying Pang LA-UR-12-20069 ICA and PCA Similarities: Blind source separation method (BSS) no model Observed signals are linear

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Bioconductor Project Working Papers

Bioconductor Project Working Papers Bioconductor Project Working Papers Bioconductor Project Year 2004 Paper 6 Error models for microarray intensities Wolfgang Huber Anja von Heydebreck Martin Vingron Department of Molecular Genome Analysis,

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012 Principal Component Analysis Applied Multivariate Statistics Spring 2012 Overview Intuition Four definitions Practical examples Mathematical example Case study 2 PCA: Goals Goal 1: Dimension reduction

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Anthea Dokidis Glenda Delenstarr Abstract The performance of the Agilent microarray system can now be evaluated

More information

Dimension Reduction and Iterative Consensus Clustering

Dimension Reduction and Iterative Consensus Clustering Dimension Reduction and Iterative Consensus Clustering Southeastern Clustering and Ranking Workshop August 24, 2009 Dimension Reduction and Iterative 1 Document Clustering Geometry of the SVD Centered

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Non-Negative Factorization for Clustering of Microarray Data

Non-Negative Factorization for Clustering of Microarray Data INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples Experimental Design Credit for some of today s materials: Jean Yang, Terry Speed, and Christina Kendziorski Experimental design Choice of platform rray design Creation of probes Location on the array Controls

More information

Weighted Low Rank Approximations

Weighted Low Rank Approximations Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Weighted Low Rank Approximations What is

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Example 1: Two-Treatment CRD

Example 1: Two-Treatment CRD Introduction to Mixed Linear Models in Microarray Experiments //0 Copyright 0 Dan Nettleton Statistical Models A statistical model describes a formal mathematical data generation mechanism from which an

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org

More information

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Data Fitting and Uncertainty

Data Fitting and Uncertainty TiloStrutz Data Fitting and Uncertainty A practical introduction to weighted least squares and beyond With 124 figures, 23 tables and 71 test questions and examples VIEWEG+ TEUBNER IX Contents I Framework

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) Chapter 5 The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 5.1 Basics of SVD 5.1.1 Review of Key Concepts We review some key definitions and results about matrices that will

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Anders Øland David Christiansen 1 Introduction Principal Component Analysis, or PCA, is a commonly used multi-purpose technique in data analysis. It can be used for feature

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Genomic Medicine HT 512. Data representation, transformation & modeling in genomics

Genomic Medicine HT 512. Data representation, transformation & modeling in genomics Harvard-MIT Division of Health Sciences and Technology HST.512: Genomic Medicine Prof. Alvin T.Kho Genomic Medicine HT 512 Data representation, transformation & modeling in genomics Lecture 11, Mar 18,

More information

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays Prof. Tesler Math 283 Fall 2015 Prof. Tesler Principal Components Analysis Math 283 / Fall 2015

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Optimal normalization of DNA-microarray data

Optimal normalization of DNA-microarray data Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La

More information

Intra- and inter- platform renormalization and analysis of microarray data from the NCBI GEO database

Intra- and inter- platform renormalization and analysis of microarray data from the NCBI GEO database Intra- and inter- platform renormalization and analysis of microarray data from the NCBI GEO database Kay A. Robbins 1, Cory Burkhardt 1 1 Department of Computer Science, University of Texas at San Antonio,

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

Error models and normalization. Wolfgang Huber DKFZ Heidelberg

Error models and normalization. Wolfgang Huber DKFZ Heidelberg Error models and normalization Wolfgang Huber DKFZ Heidelberg Acknowledgements Anja von Heydebreck, Martin Vingron Andreas Buness, Markus Ruschhaupt, Klaus Steiner, Jörg Schneider, Katharina Finis, Anke

More information

Seminar Microarray-Datenanalyse

Seminar Microarray-Datenanalyse Seminar Microarray- Normalization Hans-Ulrich Klein Christian Ruckert Institut für Medizinische Informatik WWU Münster SS 2011 Organisation 1 09.05.11 Normalisierung 2 10.05.11 Bestimmen diff. expr. Gene,

More information

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory Mathematical Tools for Neuroscience (NEU 34) Princeton University, Spring 206 Jonathan Pillow Homework 8: Logistic Regression & Information Theory Due: Tuesday, April 26, 9:59am Optimization Toolbox One

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information