Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University of Cambridge, 23 November 2012

A Cautionary Statement Most talks are finished products they cover completed research. This talk is an attempt to share our current thinking. It is like James Joyce s stream of consciousness it represents work in progress. The questions that are currently occupying my mind are discussed. Six months from now some or all of these questions might be deemed to be irrelevant!

Outline Introduction 1 Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

What is the Problem? Biologists are now able to generate massive amounts of data: Data forms are of different types: real numbers, integers, and Boolean. Data forms are of variable quality and repeatability; different error models are needed for different data forms. What my students and I hope to contribute to cancer research: Development of predictors (classifiers and regressors) based on the integration of multiple types of data. Along the way, we hope to: understand and analyze the behavior of various algorithms.

Broad Themes of Current Research 1 Understand the nature of biological data: What is being measured, how is it being measured, and most important, what are the potential sources and nature of error? The types of questions to which biologists would like to have answers, e.g. Prognosis for lung cancer patients Effects of applying multiple drug combinations, when we know only the effects of applying one drug at a time Identify possible upstream genes when gene of interest is not directly druggable and so on.

Broad Themes of Current Research 2 And then Devise appropriate algorithms to integrate multiple types of data and make predictions Validate the algorithms on existing data where outcomes are known Make predictions on outcomes of new experiments Most important: Persuade biologists to undertake those experiments Use outcomes of new experiments to fine-tune / reject algorithms Repeat

An Abstract Problem Formulation The data consists of labelled samples of the form (x ij, c i ), where i = 1,..., n, j = 1,..., p, and p n. n is the number of samples on which experiments are conducted, and p is the number of entities that are measured for each sample. c i is the class label; it depends only on the sample and not on the entities being measured. x ij, the measurement, can be one of three types: It can be a real number, a nonnegative integer, or a Boolean variable. c i, the class label, can be a real number, or an integer 2. Objective: Construct a function f such that f(x i ) is a reasonably good predictor of c i, where x i = (x ij, j = 1,..., p).

An Abstract Problem Formulation 2 In classification problems one often wants a posterior probability of belonging to a class. Define S k = {v R k + : k v i = 1}, o=1 the set of probability distributions on a set with k elements. If c i {1,..., k}, instead of seeking a function f : x f(x) {1,..., k}, one can ask that f(x) S k. Then f(x) is a k-dimensional vector where f l (x) = Pr{c(x) = l}, l = 1,..., k.

Some Terminology Suppose x i R p. (Note that this may not be true!). If c i R, then the problem is one of regression, or fitting a function to measured values. If c i {1,..., k}, then the problem is one of k-class classification. If k = 2 and c i { 1, 1} (after relabeling), and if we find a function d : R n R and set f(x) = sign[d(x)], then d( ) is called the discriminant function. Classification is an easier problem than regression. Also, a discriminant function need not be unique. Often f (or d) is a linear function of x (greater tolerance to noisy measurements, simpler interpretation, etc.).

Outline 1 Introduction Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Conventional Least-Squares Regression Define X = [x ij ] R n p, c = [c i ] R n, try to fit c with a linear function Xβ. Different ways of measuring the error of the fit lead to different methods. Standard least-squares regression is min β y Xβ 2 2. If X T X R p p has rank p, then solution is β = (X T X) 1 X T y. This is standard procedure when n > p (more measurements than parameters).

Ridge Regression Introduction If p > n, then X T X is singular. Ridge regression is: or in Lagrangian formulation, min β y Xβ 2 2 s.t. β 2 a, J ridge = min β y Xβ 2 2 + λ β 2 2 where λ is the Lagrangian parameter.

Ridge Regression 2 Introduction Hessian of J ridge = X T X + λi p is always positive definite; so optimal solution always exists and is unique (that is a good thing). In fact β ridge = (XT X + λi p ) 1 X T y. In general all components of β ridge will be nonzero not a good thing. Why does this happen? Because penalty term λ β 2 2 penalizes the the norm of the coefficient vector β, not the the number of nonzero components of β.

An NP-Hard Problem Given β R p, define its l 0 - norm as the number of nonzero components of β. Given a fixed integer k, the problem is min β y Xβ 2 2 s.t. β 0 k, or in words, find the best fit in the l 2 -norm that uses k or fewer nonzero components. Bad news! This problem is NP-hard! So we need to try some other approach.

LASSO Regression Introduction LASSO (or lasso) = least absolute shrinkage and selection operator. The problem formulation is (Tibshirani (1996)): or in Lagrangian formulation, min β y Xβ 2 2 s.t. β 1 a, J ridge = min β y Xβ 2 2 + λ β 1 where λ is the Lagrangian parameter.

LASSO Regression 2 Lasso regression is a quadratic programming problem very easy to solve. Unlike in ridge regression, in lasso the optimal β lasso has at most n components, where n is the number of measurements. General approach: Start with some choice for the Lagrange multiplier λ, compute β λ, and then let λ approach infinity. This is called the trajectory.

Elastic Net Regression The elastic formulation combines ridge and lasso regression. The problem formulation is (Zou and Hastie (2005)): min β y Xβ 2 2 s.t. α β 2 + (1 α) β 1 a, or in Lagrangian formulation, J ridge = min β y Xβ 2 2 + λ[α β 2 + (1 α) β 1 ], where λ is the Lagrangian parameter and α (0, 1) is another parameter. The elastic net solution also has at most n nonzero entries, and converges to the optimum is supposedly faster.

Outline 1 Introduction Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Support Vector Machines The data is said to be linearly separable if there exists a weight vector w R p and a bias b such that { x T > 0 if ci = 1, i w b < 0 if c i = 1. Data might not be linearly separable If data is linearly separable, then there exist infinitely many choices of w, b. Support Vector Machines (SVMs) due to Cortes and Vapnik (1997) give an elegant solution.

Linear SVM: Illustration SVM (support vector machine): Maximize minimum distance to separating hyperplane. Equivalent to: Minimize w while achieving a gap of 1.

Linear SVM: Some Properties Quadratic programming problem: very easy to solve for huge data sets. Optimal weight w is supported on just a few entries. So adding more vectors x i often does not change the optimal choice of w. In general all components of optimal w are nonzero. OK if n p, but not if p n. Linear separability is not an issue if p n 1.

Generalized SVMs Introduction Due to Bradley and Mangasarian (1998). Measure distances in feature space using some norm, then measure distances in weight space using the dual norm w d = max x 1 xt w. In traditional SVM, l 2 -norm is used, which is its own dual. In this case in general all components of optimal w are nonzero. If we use l 1 -norm on x, x 1 = i x i, and l -norm on w, w = max i x i, then optimal w has at most m nonzero entries (m = number of samples). What if data is not linearly separable? Settle for misclassifying as few samples as possible.

Modified l 1 -Norm SVM To trade off penalties for false positives and false negatives, incorporate ideas from Veropoulos et al. (1999): Choose constants λ close to zero, and α (0, 1). Let m 1 = M 1, m 2 = M 2, and let e denote a column vector of all ones. Then min (1 w,b,y,z λ)[αet n 1 y + (1 α)e n2 z t ] + λ w s.t. x t iw b + 1 y i, 1 i n 1, x t iw b 1 + z i, n 1 + 1 i n, y 0 m1, z 0 m2. If α > 0.5, then more emphasis on correctly classifying entries in M 1. Opposite if α < 0.5.

Outline 1 Introduction Introduction Compressed Sensing Techniques The Lone Star Algorithm 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

S-Sparse Vectors Introduction Compressed Sensing Techniques The Lone Star Algorithm Model: As before we have a predictor matrix X R n p. The measurement vector y R n is given by y = Xβ + z, where β R p is the parameter of interest and z 1,..., z p are i.i.d. in N(0, σ). The vector β is said to be S-sparse, where S is a specified integer, if β j = 0 for all except at most S entries. Premise: The data is generated by a true but unknown S-sparse parameter vector β, where measurements are corrupted by i.i.d. Gaussian noise.

The Dantzig Selector Introduction Compressed Sensing Techniques The Lone Star Algorithm Solution (The Dantzig Selector): Define r := y Xβ to be the residual. The problem (Candes and Tao (2007)) is: min β 1 s.t. X T r c β where c is some constant. This is equivalent to: min X T r s.t. β 1 c β where c is some (possibly different) constant. Clearly this is a linear programming problem.

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noiseless Case Suppose the matrix X satisfies the uniform uncertainty principle (UUP), basically a statement about the maximum correlation between columns of X, and that z = 0. Then recovers the unknown β exactly. min β 1 s.t. Xβ = y β

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noisy Case In the noisy case, we need to introduce a probability measure P on the set of all S-sparse vectors. All conclusions are with respect to P. β 2 Red lines depict set of 1-sparse vectors β in R 2. β 1 P is a probability measure on the set of S-sparse vectors in R p.

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noisy Case 2 There exist strictly increasing functions f, g : R + R + with f(0) = g(0) = 0 such that: For all S-sparse β except those belonging to a set of volume f(σ), the output of the Dantzig selector ˆβ satisfies ˆβ β 2 g(σ). Roughly speaking, for almost all true but unknown β, the output of the Dantzig algorithm ˆβ almost equals the true β. Explicit formulas are available for the functions f( ) and g( ).

Outline 1 Introduction Introduction Compressed Sensing Techniques The Lone Star Algorithm 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Problem Formulation Introduction Compressed Sensing Techniques The Lone Star Algorithm We focus on two-class classification: c i { 1, 1}. Without loss of generality, suppose c i = 1 for 1 i n 1, c i = 1 for n 1 + 1 i n, and define n 2 = n n 1. Objective: Given x i R p, i = 1,..., n, and class labels c i {0, 1}, find a function f : R p R such that { > 0 if ci = 1, f(x i ) < 0 if c i = 0. Or, if this is not possible, misclassify relatively few samples.

The Lone Star Algorithm 1 Compressed Sensing Techniques The Lone Star Algorithm Preliminary step: Define N 1 = {1,..., n 1 }, N 2 = {n 1 + 1,..., n}. For all j = 1,..., p, compute means µ j,l = 1 n 1 i N l x i,j, l = 1, 2. Eliminate all j for which µ j,1 µ j,2 is not statistically significant using Student t-test. Let p again denote reduced number of features. Choose integers k 1, k 2, the number of training samples for each run, such that k 1 n 1 /2, k 2 n 2 /2, k 1 k 2.

The Lone Star Algorithm 2 Compressed Sensing Techniques The Lone Star Algorithm 1 Choose k 1 samples from N 1, k 2 samples from N 2 at random. Run l 1 -norm SVM to compute optimal w on reduced feature set. Repeat many times. 2 Each optimal w has at most k 1 + k 2 nonzero entries, but in different locations. Let k denote number of nonzero entries for each run. Average all optimal weights, and retain only features with k highest weights. 3 Repeat. This is similar to RFE = Recursive Feature Elimination (Guyon et al., 2002). Algorithm: l 1 -norm SVM t-test and RFE, where SVM = Support Vector Machine and RFE = Recursive Feature Elimination. Acronym: l 1 -StaR, pronounced as lone star.

The Lone Star Algorithm 3 Compressed Sensing Techniques The Lone Star Algorithm The lone star algorithm has been applied to several data sets from various forms of cancer. Since it is based on an l 1 -norm SVM, the optimal weight vector is guaranteed to have no more nonzero entries than the number of training samples. In all examples run thus far, the actual number of features used is many times smaller than the number of training samples! The question is: Is this a general property, or just a fluke?

Outline 1 Introduction Introduction Some Open Problems A New Algorithm? 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Some Open Problems A New Algorithm? Regression and Classification with Mixed Measurements Given data {(x ij, y i )}, suppose data is of mixed type. So for a fixed sample i, some x ij R while other x ij {0, 1}. How can we carry out regression and classification in this case?

Behavior of Lone Star Algorithm Some Open Problems A New Algorithm? In all examples of cancer data to which the lone star algorithm has been tried ( 20), the number of final features selected is much less than the number of training samples. Can this be proved, or situations identified when this can be expected to happen? What is the behavior of the Dantzig selector if there is no true but unknown parameter vector β? The current proof collapses irretrievably; does something else take its place?

Outline 1 Introduction Introduction Some Open Problems A New Algorithm? 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Student t Test: Objective Some Open Problems A New Algorithm? Given two sets of samples x = (x 1,..., x n ) and y = (y 1,..., y m ), define x = 1 n x i, ȳ = 1 m y i n m i=1 j=1 denote the two means of the two sets of samples. The Student t test can be used to determine whether the difference between the means is statistically significant.

Some Open Problems A New Algorithm? Combining Features to Improve t Statistic: Depiction Given multiple measurements on two classes, we wish to combine them to increase the value of the t statistic (enhance discriminative ability). Class 1 Class 2 x 11... x 1k... x n1... x nk y 11... y 1k. y m1..... y mk... λ 1... λ k

Some Open Problems A New Algorithm? Combining Features to Improve t Statistic Questions: Given k different sets of (real-valued) measurements x 1j,..., x nj and y 1j,..., y mj, j = 1,..., k, is it possible to form a linear combination statistic k k u i = x ij λ j, v i = y ij λ j, j=1 j=1 such that, for the right choice of the weights λ, we have t(u; v) > max t(x j ; y j )? j (Combined statistic has greater discrimination than any individual statistic.) What is the optimal choice of weight vector λ?

Student t Test: Details Some Open Problems A New Algorithm? Quick Recap: Define x = 1 n n x i, ȳ = 1 m i=1 m j=1 y i denote the two means, and [ ( n 1 S P = (x i x) 2 + n + m 2 Then the quantity t(x; y) = i=1 )] 1/2 m (y i ȳ) 2. i=1 ( 1 n + 1 ) 1/2 x ȳ m S P satisfies the t distribution with n + m 2 degrees of freedom.

Optimal Choice of Weights Some Open Problems A New Algorithm? Define X = [x ij ] R n p, Y = [y ij ] R m p. For each index j, compute associated means x j = 1 n and vectors and matrices n x ij, ȳ j = 1 n i=1 m y ij, i=1 x = [ x j ] R k, ȳ = [ȳ j ] R k, c = x ȳ R k, C = Diag(c 1,..., c k ) R k k.

Optimal Choice of Weights 2 Define constants ( 1 a = n + 1 m covariance matrices ) 1/2 n 1 n + m 2, b = Some Open Problems A New Algorithm? ( 1 n + 1 ) 1/2 m 1 m n + m 2, Σ X = (X e n x)(x e n x) T, Σ Y = (Y e m ȳ)(y e n ȳ) T, where e n is a column vector of n ones, and finally Then optimal choice of λ is M = C 1 (aσ X + bσ Y )C 1. λ = 1 e T k M 1 e k e T k M 1.

Application Introduction Some Open Problems A New Algorithm? Each measurement indicates some difference between two classes. By combining measurements so as to optimize the t statistic, we can maximize the discriminatory ability, and also get a posterior probability of belonging to the two classes. Question: This work could have been done 70 or 80 years ago. Was it?

And Finally: A Confession Some Open Problems A New Algorithm? SIAM has asked me to write a book, and I have chosen a tentative title: Computational Cancer Biology: A Machine Learning Approach. What you have seen is an extended abstract of what I hope to cover. Feedback? Comments? Questions?