Compressed Sensing in Cancer Biology? (A Work in Progress)

Similar documents
Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Machine Learning for OR & FE

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

6. Regularized linear regression

High-dimensional regression

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Linear regression methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Regularization and Variable Selection via the Elastic Net

Machine Learning Practice Page 2 of 2 10/28/13

Proteomics and Variable Selection

Machine Learning And Applications: Supervised Learning-SVM

Introduction to Machine Learning

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

L5 Support Vector Classification

Linear & nonlinear classifiers

4 Bias-Variance for Ridge Regression (24 points)

Support Vector Machine (SVM) and Kernel Methods

ESL Chap3. Some extensions of lasso

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (continued)

A Short Introduction to the Lasso Methodology

Linear & nonlinear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Lecture 6: Methods for high-dimensional problems

Recap from previous lecture

CMSC858P Supervised Learning Methods

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear Models in Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Statistical Methods for Data Mining

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Least Squares Regression

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Perceptron Revisited: Linear Separators. Support Vector Machines

High-dimensional Ordinary Least-squares Projection for Screening Variables

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Sparse Approximation and Variable Selection

Discriminative Models

STA414/2104 Statistical Methods for Machine Learning II

Applied Machine Learning Annalisa Marsico

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

ECE521 week 3: 23/26 January 2017

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Convex Optimization M2

MS-C1620 Statistical inference

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Statistical Pattern Recognition

Shrinkage Methods: Ridge and Lasso

Midterm Exam, Spring 2005

Fast Regularization Paths via Coordinate Descent

A Modern Look at Classical Multivariate Techniques

Linear Model Selection and Regularization

Lecture Support Vector Machine (SVM) Classifiers

y(x) = x w + ε(x), (1)

Regularization: Ridge Regression and the LASSO

Max Margin-Classifier

Applied Machine Learning Annalisa Marsico

Least Squares Regression

18.9 SUPPORT VECTOR MACHINES

STAT 462-Computational Data Analysis

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Some models of genomic selection

Discriminative Models

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Tractable Upper Bounds on the Restricted Isometry Constant

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Exam: high-dimensional data analysis January 20, 2014

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

The prediction of house price

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

6.867 Machine learning

Convex Optimization and Support Vector Machine

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Linear Regression (continued)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Polyhedral Computation. Linear Classifiers & the SVM

MSA220/MVE440 Statistical Learning for Big Data

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

ISyE 691 Data mining and analytics

Support Vector Machines: Maximum Margin Classifiers

Linear Models for Regression CS534

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

CSC 411 Lecture 17: Support Vector Machine

Transcription:

Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University of Cambridge, 23 November 2012

A Cautionary Statement Most talks are finished products they cover completed research. This talk is an attempt to share our current thinking. It is like James Joyce s stream of consciousness it represents work in progress. The questions that are currently occupying my mind are discussed. Six months from now some or all of these questions might be deemed to be irrelevant!

Outline Introduction 1 Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

What is the Problem? Biologists are now able to generate massive amounts of data: Data forms are of different types: real numbers, integers, and Boolean. Data forms are of variable quality and repeatability; different error models are needed for different data forms. What my students and I hope to contribute to cancer research: Development of predictors (classifiers and regressors) based on the integration of multiple types of data. Along the way, we hope to: understand and analyze the behavior of various algorithms.

Broad Themes of Current Research 1 Understand the nature of biological data: What is being measured, how is it being measured, and most important, what are the potential sources and nature of error? The types of questions to which biologists would like to have answers, e.g. Prognosis for lung cancer patients Effects of applying multiple drug combinations, when we know only the effects of applying one drug at a time Identify possible upstream genes when gene of interest is not directly druggable and so on.

Broad Themes of Current Research 2 And then Devise appropriate algorithms to integrate multiple types of data and make predictions Validate the algorithms on existing data where outcomes are known Make predictions on outcomes of new experiments Most important: Persuade biologists to undertake those experiments Use outcomes of new experiments to fine-tune / reject algorithms Repeat

An Abstract Problem Formulation The data consists of labelled samples of the form (x ij, c i ), where i = 1,..., n, j = 1,..., p, and p n. n is the number of samples on which experiments are conducted, and p is the number of entities that are measured for each sample. c i is the class label; it depends only on the sample and not on the entities being measured. x ij, the measurement, can be one of three types: It can be a real number, a nonnegative integer, or a Boolean variable. c i, the class label, can be a real number, or an integer 2. Objective: Construct a function f such that f(x i ) is a reasonably good predictor of c i, where x i = (x ij, j = 1,..., p).

An Abstract Problem Formulation 2 In classification problems one often wants a posterior probability of belonging to a class. Define S k = {v R k + : k v i = 1}, o=1 the set of probability distributions on a set with k elements. If c i {1,..., k}, instead of seeking a function f : x f(x) {1,..., k}, one can ask that f(x) S k. Then f(x) is a k-dimensional vector where f l (x) = Pr{c(x) = l}, l = 1,..., k.

Some Terminology Suppose x i R p. (Note that this may not be true!). If c i R, then the problem is one of regression, or fitting a function to measured values. If c i {1,..., k}, then the problem is one of k-class classification. If k = 2 and c i { 1, 1} (after relabeling), and if we find a function d : R n R and set f(x) = sign[d(x)], then d( ) is called the discriminant function. Classification is an easier problem than regression. Also, a discriminant function need not be unique. Often f (or d) is a linear function of x (greater tolerance to noisy measurements, simpler interpretation, etc.).

Outline 1 Introduction Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Outline 1 Introduction Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Conventional Least-Squares Regression Define X = [x ij ] R n p, c = [c i ] R n, try to fit c with a linear function Xβ. Different ways of measuring the error of the fit lead to different methods. Standard least-squares regression is min β y Xβ 2 2. If X T X R p p has rank p, then solution is β = (X T X) 1 X T y. This is standard procedure when n > p (more measurements than parameters).

Ridge Regression Introduction If p > n, then X T X is singular. Ridge regression is: or in Lagrangian formulation, min β y Xβ 2 2 s.t. β 2 a, J ridge = min β y Xβ 2 2 + λ β 2 2 where λ is the Lagrangian parameter.

Ridge Regression 2 Introduction Hessian of J ridge = X T X + λi p is always positive definite; so optimal solution always exists and is unique (that is a good thing). In fact β ridge = (XT X + λi p ) 1 X T y. In general all components of β ridge will be nonzero not a good thing. Why does this happen? Because penalty term λ β 2 2 penalizes the the norm of the coefficient vector β, not the the number of nonzero components of β.

An NP-Hard Problem Given β R p, define its l 0 - norm as the number of nonzero components of β. Given a fixed integer k, the problem is min β y Xβ 2 2 s.t. β 0 k, or in words, find the best fit in the l 2 -norm that uses k or fewer nonzero components. Bad news! This problem is NP-hard! So we need to try some other approach.

LASSO Regression Introduction LASSO (or lasso) = least absolute shrinkage and selection operator. The problem formulation is (Tibshirani (1996)): or in Lagrangian formulation, min β y Xβ 2 2 s.t. β 1 a, J ridge = min β y Xβ 2 2 + λ β 1 where λ is the Lagrangian parameter.

LASSO Regression 2 Lasso regression is a quadratic programming problem very easy to solve. Unlike in ridge regression, in lasso the optimal β lasso has at most n components, where n is the number of measurements. General approach: Start with some choice for the Lagrange multiplier λ, compute β λ, and then let λ approach infinity. This is called the trajectory.

Elastic Net Regression The elastic formulation combines ridge and lasso regression. The problem formulation is (Zou and Hastie (2005)): min β y Xβ 2 2 s.t. α β 2 + (1 α) β 1 a, or in Lagrangian formulation, J ridge = min β y Xβ 2 2 + λ[α β 2 + (1 α) β 1 ], where λ is the Lagrangian parameter and α (0, 1) is another parameter. The elastic net solution also has at most n nonzero entries, and converges to the optimum is supposedly faster.

Outline 1 Introduction Introduction 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Support Vector Machines The data is said to be linearly separable if there exists a weight vector w R p and a bias b such that { x T > 0 if ci = 1, i w b < 0 if c i = 1. Data might not be linearly separable If data is linearly separable, then there exist infinitely many choices of w, b. Support Vector Machines (SVMs) due to Cortes and Vapnik (1997) give an elegant solution.

Linear SVM: Illustration SVM (support vector machine): Maximize minimum distance to separating hyperplane. Equivalent to: Minimize w while achieving a gap of 1.

Linear SVM: Some Properties Quadratic programming problem: very easy to solve for huge data sets. Optimal weight w is supported on just a few entries. So adding more vectors x i often does not change the optimal choice of w. In general all components of optimal w are nonzero. OK if n p, but not if p n. Linear separability is not an issue if p n 1.

Generalized SVMs Introduction Due to Bradley and Mangasarian (1998). Measure distances in feature space using some norm, then measure distances in weight space using the dual norm w d = max x 1 xt w. In traditional SVM, l 2 -norm is used, which is its own dual. In this case in general all components of optimal w are nonzero. If we use l 1 -norm on x, x 1 = i x i, and l -norm on w, w = max i x i, then optimal w has at most m nonzero entries (m = number of samples). What if data is not linearly separable? Settle for misclassifying as few samples as possible.

Modified l 1 -Norm SVM To trade off penalties for false positives and false negatives, incorporate ideas from Veropoulos et al. (1999): Choose constants λ close to zero, and α (0, 1). Let m 1 = M 1, m 2 = M 2, and let e denote a column vector of all ones. Then min (1 w,b,y,z λ)[αet n 1 y + (1 α)e n2 z t ] + λ w s.t. x t iw b + 1 y i, 1 i n 1, x t iw b 1 + z i, n 1 + 1 i n, y 0 m1, z 0 m2. If α > 0.5, then more emphasis on correctly classifying entries in M 1. Opposite if α < 0.5.

Outline 1 Introduction Introduction Compressed Sensing Techniques The Lone Star Algorithm 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Outline 1 Introduction Introduction Compressed Sensing Techniques The Lone Star Algorithm 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

S-Sparse Vectors Introduction Compressed Sensing Techniques The Lone Star Algorithm Model: As before we have a predictor matrix X R n p. The measurement vector y R n is given by y = Xβ + z, where β R p is the parameter of interest and z 1,..., z p are i.i.d. in N(0, σ). The vector β is said to be S-sparse, where S is a specified integer, if β j = 0 for all except at most S entries. Premise: The data is generated by a true but unknown S-sparse parameter vector β, where measurements are corrupted by i.i.d. Gaussian noise.

The Dantzig Selector Introduction Compressed Sensing Techniques The Lone Star Algorithm Solution (The Dantzig Selector): Define r := y Xβ to be the residual. The problem (Candes and Tao (2007)) is: min β 1 s.t. X T r c β where c is some constant. This is equivalent to: min X T r s.t. β 1 c β where c is some (possibly different) constant. Clearly this is a linear programming problem.

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noiseless Case Suppose the matrix X satisfies the uniform uncertainty principle (UUP), basically a statement about the maximum correlation between columns of X, and that z = 0. Then recovers the unknown β exactly. min β 1 s.t. Xβ = y β

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noisy Case In the noisy case, we need to introduce a probability measure P on the set of all S-sparse vectors. All conclusions are with respect to P. β 2 Red lines depict set of 1-sparse vectors β in R 2. β 1 P is a probability measure on the set of S-sparse vectors in R p.

Compressed Sensing Techniques The Lone Star Algorithm Behavior of Dantzig Algorithm: Noisy Case 2 There exist strictly increasing functions f, g : R + R + with f(0) = g(0) = 0 such that: For all S-sparse β except those belonging to a set of volume f(σ), the output of the Dantzig selector ˆβ satisfies ˆβ β 2 g(σ). Roughly speaking, for almost all true but unknown β, the output of the Dantzig algorithm ˆβ almost equals the true β. Explicit formulas are available for the functions f( ) and g( ).

Outline 1 Introduction Introduction Compressed Sensing Techniques The Lone Star Algorithm 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Problem Formulation Introduction Compressed Sensing Techniques The Lone Star Algorithm We focus on two-class classification: c i { 1, 1}. Without loss of generality, suppose c i = 1 for 1 i n 1, c i = 1 for n 1 + 1 i n, and define n 2 = n n 1. Objective: Given x i R p, i = 1,..., n, and class labels c i {0, 1}, find a function f : R p R such that { > 0 if ci = 1, f(x i ) < 0 if c i = 0. Or, if this is not possible, misclassify relatively few samples.

The Lone Star Algorithm 1 Compressed Sensing Techniques The Lone Star Algorithm Preliminary step: Define N 1 = {1,..., n 1 }, N 2 = {n 1 + 1,..., n}. For all j = 1,..., p, compute means µ j,l = 1 n 1 i N l x i,j, l = 1, 2. Eliminate all j for which µ j,1 µ j,2 is not statistically significant using Student t-test. Let p again denote reduced number of features. Choose integers k 1, k 2, the number of training samples for each run, such that k 1 n 1 /2, k 2 n 2 /2, k 1 k 2.

The Lone Star Algorithm 2 Compressed Sensing Techniques The Lone Star Algorithm 1 Choose k 1 samples from N 1, k 2 samples from N 2 at random. Run l 1 -norm SVM to compute optimal w on reduced feature set. Repeat many times. 2 Each optimal w has at most k 1 + k 2 nonzero entries, but in different locations. Let k denote number of nonzero entries for each run. Average all optimal weights, and retain only features with k highest weights. 3 Repeat. This is similar to RFE = Recursive Feature Elimination (Guyon et al., 2002). Algorithm: l 1 -norm SVM t-test and RFE, where SVM = Support Vector Machine and RFE = Recursive Feature Elimination. Acronym: l 1 -StaR, pronounced as lone star.

The Lone Star Algorithm 3 Compressed Sensing Techniques The Lone Star Algorithm The lone star algorithm has been applied to several data sets from various forms of cancer. Since it is based on an l 1 -norm SVM, the optimal weight vector is guaranteed to have no more nonzero entries than the number of training samples. In all examples run thus far, the actual number of features used is many times smaller than the number of training samples! The question is: Is this a general property, or just a fluke?

Outline 1 Introduction Introduction Some Open Problems A New Algorithm? 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Outline 1 Introduction Introduction Some Open Problems A New Algorithm? 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Some Open Problems A New Algorithm? Regression and Classification with Mixed Measurements Given data {(x ij, y i )}, suppose data is of mixed type. So for a fixed sample i, some x ij R while other x ij {0, 1}. How can we carry out regression and classification in this case?

Behavior of Lone Star Algorithm Some Open Problems A New Algorithm? In all examples of cancer data to which the lone star algorithm has been tried ( 20), the number of final features selected is much less than the number of training samples. Can this be proved, or situations identified when this can be expected to happen? What is the behavior of the Dantzig selector if there is no true but unknown parameter vector β? The current proof collapses irretrievably; does something else take its place?

Outline 1 Introduction Introduction Some Open Problems A New Algorithm? 2 3 Compressed Sensing Techniques The Lone Star Algorithm 4 Some Open Problems A New Algorithm?

Student t Test: Objective Some Open Problems A New Algorithm? Given two sets of samples x = (x 1,..., x n ) and y = (y 1,..., y m ), define x = 1 n x i, ȳ = 1 m y i n m i=1 j=1 denote the two means of the two sets of samples. The Student t test can be used to determine whether the difference between the means is statistically significant.

Some Open Problems A New Algorithm? Combining Features to Improve t Statistic: Depiction Given multiple measurements on two classes, we wish to combine them to increase the value of the t statistic (enhance discriminative ability). Class 1 Class 2 x 11... x 1k... x n1... x nk y 11... y 1k. y m1..... y mk... λ 1... λ k

Some Open Problems A New Algorithm? Combining Features to Improve t Statistic Questions: Given k different sets of (real-valued) measurements x 1j,..., x nj and y 1j,..., y mj, j = 1,..., k, is it possible to form a linear combination statistic k k u i = x ij λ j, v i = y ij λ j, j=1 j=1 such that, for the right choice of the weights λ, we have t(u; v) > max t(x j ; y j )? j (Combined statistic has greater discrimination than any individual statistic.) What is the optimal choice of weight vector λ?

Student t Test: Details Some Open Problems A New Algorithm? Quick Recap: Define x = 1 n n x i, ȳ = 1 m i=1 m j=1 y i denote the two means, and [ ( n 1 S P = (x i x) 2 + n + m 2 Then the quantity t(x; y) = i=1 )] 1/2 m (y i ȳ) 2. i=1 ( 1 n + 1 ) 1/2 x ȳ m S P satisfies the t distribution with n + m 2 degrees of freedom.

Optimal Choice of Weights Some Open Problems A New Algorithm? Define X = [x ij ] R n p, Y = [y ij ] R m p. For each index j, compute associated means x j = 1 n and vectors and matrices n x ij, ȳ j = 1 n i=1 m y ij, i=1 x = [ x j ] R k, ȳ = [ȳ j ] R k, c = x ȳ R k, C = Diag(c 1,..., c k ) R k k.

Optimal Choice of Weights 2 Define constants ( 1 a = n + 1 m covariance matrices ) 1/2 n 1 n + m 2, b = Some Open Problems A New Algorithm? ( 1 n + 1 ) 1/2 m 1 m n + m 2, Σ X = (X e n x)(x e n x) T, Σ Y = (Y e m ȳ)(y e n ȳ) T, where e n is a column vector of n ones, and finally Then optimal choice of λ is M = C 1 (aσ X + bσ Y )C 1. λ = 1 e T k M 1 e k e T k M 1.

Application Introduction Some Open Problems A New Algorithm? Each measurement indicates some difference between two classes. By combining measurements so as to optimize the t statistic, we can maximize the discriminatory ability, and also get a posterior probability of belonging to the two classes. Question: This work could have been done 70 or 80 years ago. Was it?

And Finally: A Confession Some Open Problems A New Algorithm? SIAM has asked me to write a book, and I have chosen a tentative title: Computational Cancer Biology: A Machine Learning Approach. What you have seen is an extended abstract of what I hope to cover. Feedback? Comments? Questions?