Sparse Proteomics Analysis

Size: px

Start display at page:

Download "Sparse Proteomics Analysis"

Ferdinand Floyd
5 years ago
Views:

Technische Universität Berlin Fakultät II Institut für Mathematik Sparse Proteomics Analysis Toward a Mathematical Foundation of Feature Selection and Disease Classification Masterarbeit zur

1 Technische Universität Berlin Fakultät II Institut für Mathematik Sparse Proteomics Analysis Toward a Mathematical Foundation of Feature Selection and Disease Classification Masterarbeit zur Erlangung des Grades Master of Science im Studienfach Mathematik vorgelegt von Martin Genzel (Matr.-Nr ) Erstgutachterin: Prof. Dr. rer. nat. habil. Gitta Kutyniok Zweitgutachter: Prof Dr. rer. nat. habil. Reinhold Schneider

2 Erklärung der Selbstständigkeit Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig und eigenhändig sowie ohne unerlaubte fremde Hilfe und ausschließlich unter Verwendung der aufgeführten Quellen und Hilfsmittel angefertigt habe. Berlin, den 20. Juni Martin Genzel ii

3 Acknowledgments First of all, I would like to thank my supervisor Prof. Gitta Kutyniok, who offered me the opportunity to write my Master s thesis on this interesting and advanced topic in applied mathematics. In particular, I am deeply grateful that she also gave me the chance to collaborate with several experts on this field, which has inspired this work substantially. I also would like to thank Prof. Tim Conrad for providing me with real-world data sets and several images which are used here. Moreover, he has helped me to understand the biological background of proteomics research and related areas. Furthermore, I am very thankful for the fruitful discussions with Prof. Roman Vershynin and Prof. Jan Vybìral, which gave me further insights and were actually the inspiration for some new ideas. My special thanks go also to my colleagues Anton Kolleck, Philipp Petersen, and Jackie Ma, who were always available for discussing problem issues with me. Last but not least, I especially would like to thank my father for spending his precious time on proofreading my work, which became quite long in the end. iii

5 Contents 1 Introduction Motivation Feature Selection from High-Dimensional Data Main Results and Related Work Structure of this Thesis Notation Biological Background of Proteomics Data What is Proteomics? Data Acquisition via Mass Spectrometry Detection of Disease Fingerprints Toward Sparse Proteomics Analysis (SPA) Transfer to a Mathematical Setting SPA at a Glance SPA-Normalize: Rescaling the Data SPA-Scattering: Robustification of the Data SPA-Standardize: Standardization by Weighted Interpolation SPA-Select: Sparse Feature Selection SPA-Sparsify: Detection of the Connected Components of ω SPA-Recover and SPA-Reduce: Feature Recovery from Scattering Coefficients and Dimension Reduction Theory for SPA Statistical Models for Proteomics Data A Correlation-Based Forward Model Discussion and Coherence Between Features A Simplified Backward Model Robustification via the Scattering Transform The Scattering Transform Application to Proteomics Data Feature Selection via Robust 1-Bit Compressed Sensing Why Using (1-Bit) Compressed Sensing? Analysis of (Rob 1-Bit) Statistical Analysis of the Proteomics Model Solution Paths and Sparsity Feature Selection via l 1 -SVM and the LASSO Geometric Intuition Behind the Squared and Hinge Loss v

6 4.4.2 Piecewise-Linear Solution Paths Sparsity and Coherent Features Numerical Experiments Feature Selection for Simulated Data Data Generation Performed Experiments Discussion Classification Performance for Real-World Data Data Sets Performed Experiments Discussion Extensions and Further Approaches Using Multiple Layers of the Scattering Transform Feature Space Maps and Kernel Learning Optimal Interpolation Weights Additional Penalties Elastic Net Fused Penalty The Adaptive LASSO and Weighted Constraints Multi-Label Classification Detection of Feature Atoms Further Approaches for Feature Selection Conclusions and Outlook 97 Appendices 101 A Mathematical Fundamentals 101 A.1 The Fourier Transform A.2 Wavelet Analysis A.2.1 The Short-Time Fourier Transform A.2.2 The (Complex) Wavelet Transform A.3 Basic Notions from Probability Theory and Statistics A.4 Convex Analysis A.5 Compressed Sensing A.5.1 Classical Compressed Sensing A Bit Compressed Sensing A.6 Machine Learning B Implementation of SPA 121 C Zusammenfassung in deutscher Sprache 123 Bibliography 125 vi

7 1 Introduction 1.1 Motivation Tumor diseases, such as cancer, rank among the most frequent reasons for death in western countries (cf. [19, 29]). During the last decades, it has turned out that many underlying pathological mechanisms are manifested at the stage of protein activities. Therefore, in order to ultimately improve clinical treatment options and diagnostics, it is of fundamental importance to understand the protein structures in the human body as well as their interactions. This challenge can be seen as the origin of proteomics research, which is the large-scale study of the human proteome, that is, the entire set of an individual s proteins at a certain point of time. Compared to the human genome (containing about 20,000 genes), a proteome is significantly more complex, involving millions of different molecule types. Proteomics data is therefore usually extremely high-dimensional, and at a first sight, the search for disease-relevant features seems to be a hopeless task. But in many practical situations, the number of molecular structures which are directly related to a specific disease is actually very small. Such a collection of characteristic features is called a disease fingerprint. Besides further insights into pathological mechanisms, the successful detection of a disease fingerprint could also lead to the development of novel biomarkers, which allow for more reliable early diagnosis of diseases (see Figure 1.1). 1.2 Feature Selection from High-Dimensional Data The previous paragraph gives already a strong evidence that the search for a disease fingerprint is naturally related to the fields of high-dimensional data analysis and sparse signal processing. However, a proteome, as a set of proteins, is defined in a rather abstract sense. A widely-used approach to represent a proteome in practice is the method of mass spectrometry (MS). This technique allows us to determine the abundance of proteins in a certain sample (blood, urine, etc.) by a so-called mass spectrum; Figure 1.2 shows a typical example. Every peak of a mass spectrum can be identified with a specific type of molecule and the corresponding amplitude is proportional to its concentration in the sample. According to the above problem, the major goal is now to detect those peaks which are strongly correlated with a certain disease. This set of features eventually forms a potential candidate for a disease fingerprint. 1

2 1 Introduction 100% 5 YR SURVIVAL 100% 80% 60% DIAGNOSIS STAGE DISTRIBUTION 80% 60% DIAGNOSIS STAGE DISTRIBUTION 5 YR SURVIVAL 40% 40% 20% 20% 0% I II III IV 0% I II III IV (a) Current (b) With

8 2 1 Introduction 100% 5 YR SURVIVAL 100% 80% 60% DIAGNOSIS STAGE DISTRIBUTION 80% 60% DIAGNOSIS STAGE DISTRIBUTION 5 YR SURVIVAL 40% 40% 20% 20% 0% I II III IV 0% I II III IV (a) Current (b) With early diagnosis Figure 1.1: Illustration of the benefit of early diagnosis: The red bars indicate the (average) chance of patient s survival after 5 years when a cancer disease was diagnosed at the corresponding stage (I IV, where IV is the final stage). Currently, in (a), the diagnosis mostly does not happen before stage III, implying that the chance of survival is already relatively low (green bars). The ultimate goal is to achieve a situation as in (b) where a disease is diagnosed much earlier. In this way, the total number of successful recoveries could be significantly increased, even without any improvement of the actual treatment options. Intensity (cts) Mass (m/z) Figure 1.2: An example of a mass spectrum (taken from the data set of Section 5.2). The (molecular) mass is plotted against the number of counted ions. Each high-amplitude peak can be identified with a specific type of molecule in the sample. The mass-value, where the peak is centered at, corresponds to its molecular mass.

9 1.3 Main Results and Related Work 3 Let us make this idea a bit more precise now: Suppose that we are given sample pairs from n patients, (x 1, y 1 ),..., (x n, y n ), where x k R d denotes the mass spectrum of the k-th patient and y k { 1, +1} indicates his/her health status (healthy or diseased). Using this training set, we ultimately intend to learn a feature vector ω R d whose support 1 shall correspond to such positions in the spectra which allow for a distinction between the groups of healthy and diseased patients. In order to satisfy the wish for a small disease fingerprint, ω should be sparse, i.e., the number of non-zero entries is significantly smaller than the dimension d; in addition, these non-zero entries should be also interpretable so that they may be identified with proteins later on. Such a process of feature selection is often a difficult task in practice. Indeed, real-world mass spectra are extremely highdimensional (d could range from 10 4 to 10 8 ), affected by strong baseline noise, and the peak amplitudes usually have a large variance. Thus, metaphorically speaking, we are actually trying to find the needle in the haystack. 1.3 Main Results and Related Work For the reasons mentioned above, the goal of this thesis is to develop Sparse Proteomics Analysis (SPA), which is a generic method for extracting biologically relevant information from high-dimensional and noisy data sets. This framework should not be considered as a single algorithm, but rather as a sophisticated combination of tools from various areas, most notably machine learning, compressed sensing, statistics, and harmonic analysis. The first version of SPA was introduced in [21], which particularly forms the basis of this work. However, there are certain limitations in this original approach, asking for an improvement of SPA. The following aspects highlight the major extensions and contributions that are presented in this thesis: Scattering transform: This framework is originally motivated by problems from audioand image-processing (see [5, 12, 57]). The basic idea here is to compute cascades of wavelet transforms (combined with the non-linear complex modulus) in order to generate a signal representation which is translation invariant and robust against diffeomorphic deformations. Interestingly, it will turn out that this approach is useful for MS-data as well: Applying the scattering transform, we will be able to significantly reduce the baseline noise and small deviations of the peak positions. Standardization: In machine learning, it is common practice to standardize the input data such that every feature variable has (empirical) mean 0 and standard deviation 1. But if only a few samples are available compared to the data-dimension, the feature selection could become unstable. Moreover, MS-data has a very characteristic functional structure which would be completely destroyed by a standardization. For this reason, we will introduce the novel approach of weighted interpolation which allows us to interpolate between the data set and its standardized version in an adaptive manner. 1 The support of ω = (ω 1,..., ω d ) R d is defined by the set supp(ω) := {l ω l 0} {1,..., d}.

10 4 1 Introduction Feature selection: The authors of [21] propose a feature selection algorithm which is based on the ideas of 1-bit compressed sensing. This strategy might appear somewhat unusual in our setting, but the corresponding classification and regression problems of machine learning are remarkably similar to those of compressed sensing. Since only little efforts have been devoted so far in investigating this relationship, it could be particularly inspiring to consider approaches from compressed sensing as well. However, it will turn out that the method of [21] is actually too simple for certain scenarios. Therefore, SPA additionally incorporates classical methods for sparse feature selection, such as l 1 -SVM ([90]) and the LASSO ([76]). In the community of machine learning, there exists the widespread opinion that the issues of feature selection and (disease) classification are basically equivalent. Although there are obvious similarities, this statement is not fully correct. Indeed, a feature vector ω R d that perfectly classifies a training set (x 1, y 1 ),..., (x n, y n ) R d { 1, +1} by y k = sign( x k, ω ), k = 1,..., n, (1.1) is not necessarily an appropriate candidate for a disease fingerprint; for instance, ω could contain too many redundant features. Therefore, it is often useful to incorporate some prior knowledge into the classification procedure, e.g., sparsity. In this way, the classification accuracy of the training data might become slightly worse, but on the other hand, the generalization performance of the classifier improves and its entries can be interpreted more easily. But even with additional constraints, the simple classification model of (1.1) does not take into account how the samples and the features-of-interest were actually generated. Thus, a successful detection of fingerprints does not solely depend on the classification accuracy but also on the underlying data model. For this reason, we will primarily focus on the more general perspective of feature extraction, and not on disease classification, which should be rather seen as an instrument to realize a selection process. The issue of the previous paragraph is closely related to the question of data-modeling. Almost all approaches in literature, including that of SPA, are based on backward models, such as (1.1), which merely describe how an observation y k can be extracted from a datum x k. But since this does not involve the underlying distribution of the samples, a theoretical analysis is entirely missing in most cases. Another major goal of this work is therefore to make a first step toward a mathematical foundation of feature selection. For this purpose, we will introduce a linear forward model for mass spectra, which precisely describes how the input data is generated. A statistical analysis will then show that SPA is indeed able to recover all relevant features. In this context, we will particularly see that the above extensions have a beneficial impact on the performance of SPA. Finally, several numerical experiments (for both simulated and real-world data) will verify that our theoretical findings also apply to practical situations. Although this thesis is focusing on the specific case of MS-data, the individual steps of SPA are designed to work in a more general setting. This particularly coincides with the fact that our rigorous analysis actually relies only on a few structural properties of the data model. Compared to typical domain-based approaches that are highly adapted to

11 1.4 Structure of this Thesis 5 the situation of MS-data (see [1, 20, 40, 77] for example), SPA might indeed perform slightly worse; but however, our framework is much more flexible, and perhaps with minor modifications it applies to different types of data, which may not even come from biology. Furthermore, we are rather aiming for a proof of concept, making a perfect tuning of our method unnecessary anyway. At this point, it should be clearly emphasized that SPA is just one of many approaches for feature selection from proteomics data. A brief overview of recent methods from literature is given in Section Structure of this Thesis Chapter 2 provides more details on the biological background of SPA. However, the language of this part will be still rather informal. Subsequently, the framework of SPA is introduced in Chapter 3. Some intuition behind the single algorithms is already given there, but a detailed discussion is postponed to the next chapter. In fact, Chapter 4 may be seen as the heart of this thesis and develops a theory for SPA. Here, we will especially define an abstract model for MS-data and analyze how our method performs in this situation. The experimental results are then presented in Chapter 5, and finally, we will consider various extensions of SPA in Chapter 6. This structure particularly implies that the discussion is distributed over several chapters. Thus, the thesis is written in such a way that the reader may also jump from chapter to chapter in order to read about different aspects of a certain topic, e.g., the scattering transform or feature selection. 1.5 Notation Before moving on to next chapter, it seems to be reasonable to introduce some general notations that will frequently occur in this work: Numbers and sets: The sets of numbers are denoted as usual by N, Z, Q, R, and C, where 0 is not included in N. Thus, we put N 0 := N {0}. For t R, we define the following functions: Floor: t := max{k Z k t}, Ceiling: t := min{k Z k t}, +1, t > 0, Signum: sign(t) := 1, t < 0, 0, t = 0, Positive part: [t] + := max{t, 0}, Negative part: [t] := min{t, 0}.

12 6 1 Introduction If z C is a complex number, we denote the real part by Re(z), the imaginary part by Im(z), and its complex conjugate by z. Moreover, the complex modulus (or absolute value) of z is given by z = Re 2 (z) + Im 2 (z). The cardinality of an arbitrary (finite) set Ω is denoted by #Ω. Furthermore, we will often deal with index sets of natural numbers. In this case, it is beneficial to use the compact notation [n] := {1,..., n} where n N. Euclidean vector spaces: Let d N. 1 Vectors of R d are always written in bold face, e.g., x, y R d. The entries of a vector are then denoted by the same letter with a subindex: for x R d, we write x = (x 1,..., x d ). But note that x is however considered as a column vector. Thus, the expression [ x 1... x n ] R d,n is actually a (d n)-matrix with columns x 1,..., x n R d. Finally, the transpose of a vector x R d and a matrix A R d,n are denoted by x and A, respectively. Now, let x R d and p (0, ) { }. Then, we define the l p (-quasi)-norm by ( d ) 1/p x p := l=1 x l p, 0 < p <, max l [d] x l, p =. The associated (closed) l p -ball is given by B d p := { x Rd x p 1}. In the special case of p = 2, we additionally have the euclidean scalar product x, x := x x = d x l x l, x R d. l=1 The orthogonal complement of a subset V R d is then defined by V := {x R d x, x = 0 for all x V }. Finally, we introduce the support of x by supp(x) := {l [d] x l 0} and the associated l 0 -norm x 0 := # supp(x), which simply counts the number of non-zero entries. 2 Function spaces: Let Ω R d be an open set and p [1, ) { }. Similar as above, we can define the L p -norm for a measurable function f : Ω C by ( 1/p f(x) dx) p, 1 p <, f p := Ω inf sup f(x), p =, Ω 0 Ω x Ω\Ω 0 vol(ω 0 )=0 1 In the following, the letter d will always play the role of the vector space dimension. 2 Note that 0 is not even a quasi-norm, but we have x 0 = lim p 0 x p p for all x R d.

13 1.5 Notation 7 as well as the associated Lebesgue function spaces L p (Ω) := {g : Ω C measurable g p < }. If p = 2, there again exists a scalar product f, g := f, g L 2 (Ω) := f(x)g(x) dx, f, g L 2 (Ω). Furthermore, we will also consider the classical spaces of smooth functions C 0 (Ω) := {f : Ω C f is continuous on Ω}, C r (Ω) := {f : Ω C f is r-times continuously differentiable on Ω}, Ω for r N, and C (Ω) := r N C r (Ω). Finally, the support of a function f : Ω C is defined by the closed set supp(f) := {x Ω f(x) 0} Rd. Less frequently-used notations will be introduced when they are needed. It is furthermore assumed that the reader is familiar with the fundamentals of linear algebra, analysis, and probability theory. An introduction to more advanced concepts, which are required for the following chapters, is given in Appendix A.

15 2 Biological Background of Proteomics Data This chapter gives a brief overview of the medical background of SPA. Since the present work is aiming at the mathematical concepts behind fingerprint detection and disease classification, we will merely confine ourselves to a description of the basic facts about the underlying biological, chemical, and physical principles here. This should be sufficient to understand the purpose of our mathematical framework, but for the interested reader, there will be several references to literature given, providing much more background knowledge. 2.1 What is Proteomics? Almost all biological processes in the human body are controlled by proteins and their mutual interactions. Especially the underlying pathological mechanisms of diseases are often related to specific molecular structures. The detection of such characteristic biomarkers may therefore allow for a better understanding of a certain disease, which could ultimately lead to an improvement of early diagnosis and clinical treatment options. The entire set of proteins that an individual can produce at a certain point of time is called a proteome; in other words, a proteome may be regarded as a snapshot of the current state of all proteins. While the number of genes in the human body is relatively small (about 20,000, cf. [47]), the possible molecular structures of proteins, which these genes can encoded for, is significantly higher. This enormous complexity has particularly stressed the need for novel methods to map the human proteome and can be seen as the origin of proteomics research, which is the large-scale study of proteins and their dynamic interactions (cf. [20]). Besides proteomics, there are numerous branches in biology that also deal with so-called omics-data, e.g., genomics, metabolomics, or transcriptomics. The starting point for these research topics was the Human Genome Project, whose successful completion in 2001 led to the complete identification of the genome, which is similar to a proteome the collection of all genes in the human body ([51]). However, the genome rather forms a construction plan for an individual, whereas the actual cellular mechanisms (like those of cancer tumors) are mostly manifested at the level of protein activity ([44]). Inspired by the success of the Human Genome Project, scientists have started to set up similar collaborations in the field of proteomics, for instance, the Human Proteome 9

16 10 2 Biological Background of Proteomics Data Laser Sample Electric Field + + Detector Intensity (cts) Mass (m/z) Figure 2.1: The workflow of MALDI-TOF MS: In a first step, the MALDI (matrix-assisted laser desorption ionization), the molecules in the sample are ionized by a laser beam. The resulting ions are then accelerated within an electric field until they hit the detector. In this step, the time-of-flight (TOF) is measured, which is proportional to the square root of the mass-to-charge ratio (m/z). This finally results in a plot that counts the total number of ions (intensity) for every m/z-value. For a demonstrative animation of this concept, see also [81]. Organization ( Although a proteome is much more difficult to understand than a genome, there has been remarkable progress in the last decade. One major goal of the proteomics community was (and still is) to develop a draft map of the proteome which describes how protein structures are encoded by genes. In fact, the very recent work of [47] announces that already a great part ( 84%) of all genetically-encoded proteins has been identified. 1 This catalog of the human proteome was created by using high-resolution mass spectrometry. In the next section, we will introduce this popular technique of data acquisition and sketch how it allows us to capture a proteome in the form of a (digital) vector. For further information on omics-data, the reader is referred to [20] and [14], where the latter one contains an extensive glossary with hundreds of additional references on this topic. 2.2 Data Acquisition via Mass Spectrometry Very roughly speaking, the method of mass spectrometry (MS) generates a snapshot of the proteome (or at least a part of it) by plotting the abundance of molecular ions, contained in a certain sample, against their corresponding mass-to-charge ratio. As an appropriate biological sample, one could use, for instance, blood serum or other body fluids, such as urine (cf. [20]). There are several types of mass spectrometry, each one coming along with characteristic advantages and drawbacks depending on the specific application. The real-world data sets considered in this thesis (see Section 5.2) are acquired from blood serum using MALDI- TOF MS, and therefore, we merely focus on this particular technique here. The schematic workflow of a MALDI-TOF MS-machine is illustrated in Figure More precisely, the structure of proteins encoded by 17,294 genes has been uncovered, where the total number of protein-encoding genes is 20,687.

17 2.2 Data Acquisition via Mass Spectrometry 11 Intensity (cts) Mass (m/z) Figure 2.2: Plot of a real-world mass spectrum (taken from the data set of Section 5.2). The magnification shows the typical shape of a single peak. The one-dimensional output of an MS-machine is called a mass spectrum. As already depicted in Figure 2.1, the values on the x-axis are referred to the mass-to-charge ratio (m/z-value), which carries the information about the mass, 1 and the y-axis contains the corresponding ion-count (intensities). Thus, we can consider a mass spectrum as a function that relates a mass-value m to the abundance of those molecules within the sample having an atomic mass equal to m. A typical example of a mass spectrum is shown in Figure 2.2. The most dominant features are obviously the peaks that occur at different positions in the spectrum. 2 Each of these characteristic amplitudes is centered at a specific mass (m/z-value) and can be assigned to some protein type contained in the sample. Its intensity, i.e., the height of a peak, is proportional to the molecular concentration, which finally allows us to compare the (relative) abundance of different proteins. Thus, as already mentioned in Section 2.1, a mass spectrum can be indeed seen as a draft map of the proteome. Note that in this context, it is basically equivalent to speak of proteins or peaks. But in the following, the latter term will be used, since we aim for a mathematical description of a proteome. The magnification of Figure 2.2 reveals that a peak is not simply a sharp spike but rather forms a narrow Gaussian density function. 3 However, the assumption of a Gaussian shape is often only an approximation of the truth; one can easily imagine more 1 As a precise definition for the mass-to-charge ratio the IUPAC Gold Book [60] states: The abbreviation m/z is used to denote the dimensionless quantity formed by dividing the mass number of an ion by its charge number. Since these details from physics are only of minor importance for our mathematical framework, we will mostly refer m/z-values to the mass of an ion or a protein molecule. 2 When speaking of a peak, we do not only mean the maximal amplitude but the entire peak function, which has a certain spatial extend. The center (or position) of a peak then refers to the mass-value at which the maximal amplitude is attained. 3 Formally, this means that a peak is described by a map x I exp( (x c) 2 /β 2 ), where c is the (mass-)center and the intensity I > 0 is relatively large, compared to the width parameter β > 0.

18 12 2 Biological Background of Proteomics Data complicated situations, for instance, when a protein occurs with different isotopes, 1 leading to a peak asymmetry. In order to keep our analysis simple and non-technical, we will only consider the Gaussian case, but the arguments should easily generalize to more complex structures. Unfortunately, some biologically relevant molecules, like hormones, occur only with a very low concentration in MS-samples. In fact, these minor peaks might get buried by the baseline, which is usually affected by strong noise (cf. [20]). Hence, reducing the noise and robustly detecting small-amplitude features are major challenges in biomarker identification. In practice, a mass spectrum forms a discrete object, that is, a vector x = (x 1,..., x d ) R d whose indices correspond to the mass and its entries to the intensities. The density of this sampling clearly depends on the resolution of the MS-machine; the real-world mass spectra which we will consider in Chapter 5 contain about 40,000 entries. Since this is significantly larger than the number of available samples (about 150), we are actually in the setting of high-dimensional data analysis. More details on MS and its applications (to proteomics) can be found in [20]. For a more extensive work on this topic, one may also consider [39], which provides an overview of proteomics sample preparation as well as modern methods for data acquisition. 2.3 Detection of Disease Fingerprints Let us now consider how MS-data can help to classify samples from diseased and healthy patients and to detect biomarkers that are related to the corresponding disease. In particular, we shall discuss some difficulties which may arise in the situation of real-world data. Note that the key aspects of this section are still presented in a rather informal manner, whereas a more rigorous treatment of the problem is postponed to the next chapters. As already emphasized in the previous sections, many disease types can be identified by the absence or occurrence of certain proteins in the proteome. To illustrate this fact, let us consider the toy model of Figure 2.3. Here, we compare two blood samples, one from a healthy individual and another one suffering from a certain disease. To keep the argumentation as simple as possible, there are only four different protein types involved (green, red, blue, orange). The concentrations of the red and blue proteins are almost equal for both samples; the green one is more strongly represented in the healthy proteome whereas the orange one does not occur at all there. Thus, the (arithmetic) difference between both spectra yields a disease fingerprint, indicating which positions in the spectra are apparently correlated with the disease (third step of Figure 2.3). Note that we are merely speaking of correlation here, because there does not necessarily has to exist a causal dependence. For example, a peak could become extremely relevant when one group of patients is treated by a specific drug which does not affect the origin of the disease. 1 Isotopes are atoms whose nuclei have the same number of protons but differ in their neutron-count.

19 2.3 Detection of Disease Fingerprints 13 Samples Mass Spectra Feature Selection Blood from healthy individual MS Intensity (cts) Mass (m/z) Disease Fingerprint Blood from diseased individual Comparing MS Intensity (cts) Mass (m/z) Figure 2.3: A toy example of MS-data from a healthy (top) and diseased (bottom) patient, based on an illustration in [21]. In practice, the number of samples per group is of course greater than one and the situation becomes more complicated. The exercise of finding the significant differences between two sample groups is a typical classification problem. In fact, we could easily use the fingerprint set of Figure 2.3 to classify spectra of unknown patients by determining the abundance of the green or orange protein. In general machine learning theory, one usually takes a more geometric perspective and considers each spectrum as a point in a high-dimensional euclidean vector space. A disease fingerprint is then also represented by a vector ω = (ω 1,..., ω d ), acting as the normal of a hyperplane that allows for an appropriate separation of both groups. 1 This important idea actually forms the basis of our classification model for SPA and will be extensively studied in the Chapters 3 and 4. On the other hand, the interpretability of a disease fingerprint also plays a central role in medical research. Ideally, every non-zero entry of a fingerprint vector ω should correspond to a molecule that is directly correlated with a certain type of disease ([21]). To emphasize the importance of the relationship between separability and interpretability, we shall consider another, more sophisticated example, which is shown in Figure 2.4. The magnification already suggests the relevant biomarkers, but the situation is not as clear as in Figure 2.3 anymore: Especially the shape of the large peaks is disturbed (on average) such that one might suspect a good separability at these positions as well. Furthermore, it is a bit inappropriate to speak of peaks in this context because the peak centers might slightly deviate from spectrum to spectrum, or sometimes, a peak is even completely buried by the noise. For this reason, we will usually prefer to speak of a feature, which simply refers to some position in a spectrum (i.e., an index of a data vector x = (x 1,..., x d )). On the other hand, the word peak really means a Gaussian-shaped function contained in an individual spectrum. The example of Figure 2.4 shows that exclusively striving for the best geometric separation entails the risk of selecting features which are only affected by noise. In this way, we could indeed obtain a perfect classification result for the given training samples, but an unknown spectrum could be easily misclassified due to the randomness of the noise. 1 A formal definition of hyperplanes and the separation of sets is given in Appendix A.4.

20 14 2 Biological Background of Proteomics Data Intensity (cts) Noisy Peaks Disease Fingerprint Mass (m/z) Figure 2.4: Illustration of simulated data sets (taken from SG-NX0.1 in [21]). The underlying mass spectra were generated as follows: A sample from the healthy group (red) is obtained by adding low-amplitude noise to a fixed base spectrum. The diseased samples (blue) are generated similarly, except that three additional peaks (marked by the green bullets in the right magnification) are artificially inserted. These are exactly the biomarkers that we would like to find in the end. The actual plots were created by taking the average spectra for both groups separately (100 samples per group). The left magnification shows some large, noisy peaks and the right one the three biomarkers, forming the desired disease fingerprint. In other words, a rather complex classifier (with many non-zero entries) usually leads to a higher generalization error. This trade-off between the training error and model complexity is closely related to the well-known principle of Occam s razor which states that the simplest explanation for the observations is probably the best one (see also Appendix A.6). As a consequence, we shall especially aim for a sparse fingerprint vector; 1 this goal is particularly consistent with the fact that in many practical scenarios, only a very few molecular structures are significantly correlated with a specific disease. Thus, we can summarize our main objective: We aim for a very small fingerprint set that is interpretable (features corresponding to real molecules) and highly correlated with the health status of the samples (implying a good classification performance). Before formalizing the above setting, we should first discuss some difficulties that typically arise in feature selection from real-world data. In fact, the following three issues were the main motivation for extending the original SPA-approach of [21] by some additional steps: space. 1 This means that the number non-zero entries is much smaller than the dimension of the ambient

21 2.3 Detection of Disease Fingerprints 15 (C1) Reproducibility of the data: The training samples are usually acquired by the same MS-machine which has specific properties and calibration. In practice, it can be very difficult to reproduce the same measuring conditions with other mass spectrometers. This is a particular reason why it is so important to extract biomarkers from a disease classifier such interpretable information is then independent of the actual measurement process. For more details on the reproducibility of MSmeasurements, see [3, 84] for example. (C2) Small shifts of the peak positions: When sliding through the single mass spectra of a training data set, one often realizes small deviations of the peak centers. The corruption of the average peaks in Figure 2.4, for instance, was produced by small random shifts of the entire spectra. It was already pointed out above that such a kind of noise might pretend the relevance of unimportant features. One possible reason for this phenomenon in practice could be the workflow of a clinical study: Usually, the diseased patients are chosen carefully in advance and the corresponding mass spectra are then generated consecutively. On the other hand, the potential reservoir of healthy probands is much larger. Hence, there might elapse some time between the examinations, leading to slightly different calibrations of the used mass spectrometer. Later in this thesis, we will see that the resulting peak shifts can be eliminated by applying the scattering transform. (C3) Detection of very small features: One typical difficulty of feature selection is that the amplitude of relevant biomarkers could be very small, compared to some other uncorrelated features. Therefore, a tiny feature might sometimes only hardly contribute to the separability of the samples and could be eventually missed by the selection process. Once more, the example of Figure 2.4 serves as an illustration of this problem. A common approach is to standardize 1 all features (separately) in order to make them equally important. But such a strategy would not take into account the characteristic structure of a mass spectrum and the influence of the baseline noise would become even more dramatic. Hence, SPA particularly incorporates a weighted interpolation which allows us to control the standardization procedure. 1 This means that every feature variable is centered and its standard deviation is normalized to 1.

23 3 Toward Sparse Proteomics Analysis (SPA) The previous chapter was written in a rather informal style in order to point out the main difficulties and purposes of feature selection from MS-data. Now, we transfer this biological background to a mathematical model and present the details of the SPA-framework. This chapter is substantially based on the work of [21], where SPA was originally introduced. But, although we will keep the name Sparse Proteomics Analysis, the original version of SPA is vastly extended here; this especially concerns the steps of SPA- Scattering and SPA-Standardize in Section Transfer to a Mathematical Setting It was already pointed out in the introduction of this work that our major task is to learn a disease fingerprint by using a set of sample data (mass spectra) where the health status is known in advance; from the viewpoint of machine learning, this means that we are in the setting of supervised learning. 1 Thus, let us assume that we are given n samples which are always indexed by k [n]. 2 The group labels (or disease labels) are denoted by y k { 1, +1}, i.e., y k = +1 if the k-th patient is healthy and y k = 1 if the patient is diseased. In particular, we will also speak of the healthy group G + := {k [n] y k = +1} and diseased group G := {k [n] y k = 1}. The mass spectrum of the k-th patient is expressed by a vector x k = (x k,1,..., x k,d ) R d. More precisely, the entry x k,l corresponds the intensity of the k-th spectrum at the position l [d], which is proportional to a certain mass (m/z-value). 3 But we shall not care about this physical correspondence anymore, since it is meaningless for the mathematical theory. Instead, we simply speak of a feature (or feature variable) when referring to an index l [d]. To summarize, the input of SPA is a set of training pairs (x 1, y 1 ),..., (x n, y n ) R d { 1, +1}. In some situations, it is more convenient to use a matrix notation for the sample data. Hence, we also introduce the matrix X := [ x 1... x n ] R n,d, 1 A brief introduction to the problem of supervised machine learning is given in Appendix A.6. 2 In particular, we will not always explicitly mention that the index k is an element of [n] and that it refers to a certain sample. 3 Similarly to the index of the samples, we will always address the entries of x k by the index variable l [d]. 17

24 18 3 Toward Sparse Proteomics Analysis (SPA) Biological Meaning Mathematical Notation Number of samples (patients) n; the corresponding index is k [n] Health status of k-th patient y k { 1, +1} (vector notation) (y = (y 1,..., y n ) { 1, +1} n ) Dimension of a single mass spectrum d; the corresponding index is l [d] Mass spectrum of k-th patient x k = (x k,1,..., x k,d ) R d (matrix notation) (X := [ ] x 1... x n R n,d ) Feature vector (fingerprint) ω = (ω 1,..., ω d ) R d Table 3.1: An overview of the notation that we use for SPA. which contains all mass spectra as rows as well as the label vector y := (y 1,..., y n ) { 1, +1} n. Table 3.1 provides a compact overview of all notations that we have just introduced. The actual selection of features is incorporated by a feature vector ω = (ω 1,..., ω d ) R d. Ideally, the support of ω shall describe an appropriate disease fingerprint and the corresponding non-zero entries indicate the relevance of the single features. In machine learning theory, ω plays the role of a linear classifier, i.e., the associated decision hyperplane (with some intercept ω 0 R) 1 H(ω, ω 0 ) = {x R d x, ω + ω 0 = 0} should appropriately separate the point sets {x k } k G+ mathematically expressed by asking for and {x k } k G. This task can be y k = sign( x k, ω + ω 0 ) (3.1) for (almost) all k [n]. Geometrically, the term x k, ω + ω 0 measures (up to a constant) the signed distance of x k to the hyperplane H(ω, ω 0 ), and by taking the sign, we can determine on which side of H(ω, ω 0 ) the point x k lies; see Figure 3.1 for an illustration. According to the demand for a small disease fingerprint, the major purpose of feature selection in SPA is therefore to learn a sparse feature vector ω which satisfies the model equation (3.1) for many k [n]. We will return to this point when discussing the step of SPA-Select. Remark 3.1 (1) The role of the intercept ω 0 : Later on, we will perform a centering, i.e., we subtract the average spectrum from each data vector: x k 1 n n x k, k =1 k [n], see SPA-Standardize for details. Such a standardization often allows us to determine the intercept ω 0 more easily: In [40, Exercise 3.5] for example, it is shown that ω 0 can be 1 More details on hyperplanes and the separation of sets are given in Appendix A.4.

25 3.1 Transfer to a Mathematical Setting 19 Figure 3.1: Illustration of a separating hyperplane in two dimensions (d = 2). A sample (x k, y k ) is said to lie on the correct side of the hyperplane if (3.1) is satisfied, or equivalently, y k x k, ω > 0. Note that there might be some outliers on the wrong side of the plane. estimated by ȳ := 1 n n k=1 y k, which is independent of X. We will usually assume that the groups have equal size, i.e., #G + = #G, implying that ω 0 ȳ = 0. Therefore, ω 0 is mostly omitted in our analysis by setting it to 0. (2) Applicability of the model: Although the feature selection process is inspired by (3.1), it is not clear whether such a simple model is really able to tackle all practical situations. The exercise of finding an appropriate classifier ω for (3.1) is a typical example of a backward model (or discriminative model). In general, learning a backward model means that we would like to find a good explanation for the observation y k, given a datum x k. Thus, ω can be seen as a filter which extracts the relevant features from x k in order to predict the health status y k. However, such a model does not describe how the training pairs (x 1, y 1 ),..., (x n, y n ) are actually generated and probably, the sample data are drawn from a much more complicated probability distribution. This poses the risk of misinterpreting a feature vector. We shall discuss this important issue in Chapter 4 again, where we also explain why it still makes sense to assume a simplified model like (3.1).

26 20 3 Toward Sparse Proteomics Analysis (SPA) 3.2 SPA at a Glance We are now prepared to provide the details of SPA. The following algorithm gives an overview of all performed steps, but all explicit computations are devoted to the subsequent subsections. Algorithm 3.1: Sparse Proteomics Analysis (SPA) Input: MS-data samples with labels, {(x k, y k )} k [n] (R d { 1, +1}) n Output: Feature vector ω R d Preprocessing 1 SPA-Normalize: Normalize the spectra by scalar factors to make them comparable. 2 SPA-Scattering: Compute scattering coefficients to make the data robust against noise. 3 SPA-Standardize: Center the data and preform a weighted standardization. Sparse Feature Selection 4 SPA-Select: Perform a feature selection method on the preprocessed data to obtain a sparse feature vector ω. Postprocessing 5 SPA-Sparsify: Detect the connected components of ω to sparsify even further. 6 SPA-Recover (optional): Correct the subsampling rate of SPA-Scattering to obtain a feature vector in the raw-data-space. SPA-Reduce (optional): Reduce the dimension of the data by projecting onto the space of features that were selected by ω. In order to keep the following descriptions short and clear, the explicit computations are summarized in algorithmic frames. Moreover, there will be some intuitive justification given for each individual step, which should be essentially enough for a basic understanding of SPA. A deeper (theoretical and numerical) analysis is then postponed to the Chapters 4 and 5. Finally, several aspects on the implementation of SPA are discussed in Appendix B.

27 3.2 SPA at a Glance SPA-Normalize: Rescaling the Data Recalling the detection process of a MALDI-TOF MS-machine from Section 2.2, we observe that the absolute amount of molecules which hit the detector depends on the total measurement time. Thus, the peak amplitudes may not be directly comparable for different spectra, for instance, when the measurement time was not the same for all samples. Therefore, to achieve a global comparability, an appropriate rescaling of the data is sometimes necessary. This leads to the following step: Algorithm 3.2: SPA-Normalize Input: Raw mass spectra X = [ ] x 1... x n R n,d ; Scaling factors λ 1,..., λ n > 0 Output: Normalized spectra X norm = [ x norm 1... x norm ] n R n,d Compute x norm k := λ k x k for all k [n]. The choice of the scaling factors λ 1,..., λ n > 0 highly depends on the way in which the raw data were acquired. As we have already argued for the case of MALDI-TOF MS above, it is quite natural to have the same total ion count for every spectrum. Mathematically, this means nothing else but taking the sum of all intensities, i.e., the l 1 -norm of a spectrum. Thus, we choose λ k = 1 x k 1 = 1 d l=1 x k,l, k [n]. Intensity (cts) Intensity (cts) Mass (m/z) (a) Mass (m/z) (b) Figure 3.2: (a) Example of a raw mass spectrum, where the baseline is indicated by the red line. (b) Correponding spectrum after top-hat filtering.

28 22 3 Toward Sparse Proteomics Analysis (SPA) In literature (cf. [70]), this simple l 1 -normalization is usually known as rescaling by the average area under the curve (AUC) or total ion current (TIC). Remark 3.2 (Baseline Correction) A raw mass spectrum is typically affected by a nontrivial baseline, such as in Figure 3.2(a). Since these baselines may significantly differ from spectrum to spectrum, it is indispensable to perform some kind of correction. For instance, one could detect the baseline of each spectrum by applying the so-called tophat operator ([21, 70]), which is based on the concept morphological filters ([72]). This estimated baseline is then subtracted from the corresponding spectrum (Figure 3.2(b)), so that we obtain a corrected sample set in the end. However, the (real-world) data considered in Chapter 5 was already treated by an appropriate baseline correction, and consequently, we do not have to pay attention to this challenge anymore.

29 3.2 SPA at a Glance SPA-Scattering: Robustification of the Data As already stated in the challenge of (C2) in Section 2.3, MS-data is usually not only affected by systematic noise on the baseline but also by some perturbation on the massaxis, i.e., the actual peak positions might slightly deviate from spectrum to spectrum. To achieve a robust detection of fingerprints, it is therefore evident to first transform the data in an appropriate way. The approach that we will take in this work is based on the socalled scattering transform. This framework was introduced by S. Mallat in [57] and makes use of the complex wavelet transform. Although originally designed for applications in audio- and image-processing (cf. [5, 12]), the underlying ideas of the scattering transform surprisingly apply for MS-data as well. In order to keep this section clear, a detailed introduction to the required wavelet theory is postponed to Appendix A.2. To get some impression of what the step of SPA-Scattering is actually doing, let us first assume that we are able to measure with infinitely high precision, i.e., the acquired spectrum is a (compactly supported and bounded) function f : R R 0. 1 Furthermore, let ψ L 1 (R) L 2 (R) be some (complex-valued) wavelet filter and φ L 1 (R) L 2 (R) an associated low-pass filter. Given a window scale J Z and a wavelet scale j Z with j J, the scattering coefficients of f, with respect to j and J, are defined by [S J [j]f](t) := f ψ j φ J (t) (3.2) = f(u) 2 j ψ(2 j (v u)) du 2 J φ(2 J (t v)) dv, t R. R R This definition basically consists of two stages: At first, by f ψ j, the (complex) wavelet coefficients of f at scale j are computed. For MS-input data, a good choice of j highly depends on the width of the peaks in f. In fact, our major goal is to extract the characteristic peak structure from f; but at the same time, we would like to filter out finer details which are supported at much higher frequency-values, e.g., the baseline noise. In this way, a mass spectrum is smoothed, whereas its essential structural features are preserved. The purpose of the second step in (3.2), i.e., applying the complex modulus and a convolution with φ J, is rather to produce a certain invariance against (small) peak shifts. Indeed, by such a windowed averaging, the resulting function becomes more robust against (local) translations. When choosing J increasingly larger, the averaging effect becomes stronger, and in the limit J, we would basically compute the L 1 -norm of f ψ j, which is completely translation invariant. But obviously, important information of the spectrum might get lost for very large values of J, and therefore, the choice of the window scale requires again particular attention. Finally, it should be emphasized that the role of the complex modulus in (3.2) is crucial: If we would omit this step, the scattering coefficients would be almost zero due to the vanishing moment property of ψ. This important observation will be also discussed in Section Since f(t) physically represents an intensity for every t R, we can assume that its value is non-negative.

30 24 3 Toward Sparse Proteomics Analysis (SPA) For practical applications, the computation of (3.2) clearly needs to be discretized (cf. [10]). This leads to the step of SPA-Scattering, which looks slightly more technical: Algorithm 3.3: SPA-Scattering Input: Normalized spectra X norm = [ x norm 1... x norm ] n R n,d ; Appropriate wavelet filter ψ L 2 (R) and low-pass filter φ L 2 (R); Window scale J N 0 and wavelet scale j N 0 with j J Output: Transformed spectra X scat = [ x scat 1... x scat ] n R n,d, where d = d 2 J Compute 1 For every k [n], compute the wavelet coefficients by discrete convolution: x wav k,l+1 := [xnorm k ψ j ](l) := d 1 l =0 x norm k,l +1 ψ j(l l ) C, l = 0,..., d 1. 2 Apply windowed averaging to the complex modulus of the wavelet transform and subsample at a rate of 2 J : x scat k,l+1 := [ xwav k φ J ](2 J l) := d 1 x wav l =0 k,l +1 φ J(2 J l l ), l = 0,..., d 2 J 1. (3.3) Up to now, it is not clear what appropriate wavelet filter actually means. Rather technical admissibility conditions are presented in [12] and [57], but we omit such a definition here because the only choice of ψ considered in this thesis is the Morlet wavelet, which is given by ψ Mor (t) := π 1/4 (e iξ 0t e ξ2 0 /2 ) exp( t 2 /2), t R, (3.4) where ξ 0 > 0 is fixed. In practice, one usually chooses ξ 0 = 5, making the term e ξ2 0 /2 in (3.4) negligible (cf. [23]). Thus, ψ Mor is indeed very close to the classical complex Gabor wavelet (see (A.6) in Appendix A.2.2). Figure 3.3 shows the real and imaginary part of a Morlet wavelet. The associated low-pass filter is usually chosen to be a Gaussian window of the form φ Gauss 1 (t) := exp( t 2 /(2β 2πβ 2 0)), 2 t R, (3.5) 0 with β 0 > 0 ([12]). The resulting data vectors x scat 1,..., x scat n R d from SPA-Scattering have a lower dimension, namely d = d d. This is due to the downsampling at the critical 2 J sample rate of 2 J, which takes place in (3.3). In the context of machine learning, SPA- Scattering can be also seen as a feature map which transforms points of R d into the feature space of scattering coefficients and this is precisely the space in which we will

31 3.2 SPA at a Glance 25 (a) (b) Figure 3.3: Plots of the (a) real and (b) imaginary part of a Morlet wavelet defined by (3.4) with ξ 0 = 5. work from now on. However, one should be aware of the fact that this mapping involves a dimension reduction, rather than a lifting into a higher- or even infinite-dimensional space. Figure 3.4 already indicates that working in the scattering space is qualitatively similar to working with the raw data. Indeed, the characteristic structure is preserved, and therefore, it still makes sense to speak of spectra and peaks. This observation will be discussed again in Section 4.2, where we will introduce the general concept of the scattering transform and study its impact on MS-data in greater detail. Remark 3.3 (1) Alignment of mass spectra: The above approach is essentially based on a smoothing by local integration. Therefore, the resulting coefficients actually contain information about a whole range of features. This can indeed help us to solve the problem of peak shifts because in this way, we are able to compare features of different spectra which are not exactly at the same position. On the other hand, one could follow an alternative approach which tries to fix the deviation on the mass-axis directly: The basic idea here is to translate every spectrum (locally) so that all corresponding peak centers are finally at the same position. This ansatz is usually known as the alignment of peaks in literature. There are specific scenarios in practice where such a strategy could indeed outperform the scattering transform (or similar averaging methods). For instance, the spectra could be only affected by constant (but unknown) translations; in such a case, it would be relatively easy to find these global shifts and to realign the data appropriately. For some general approaches, one may consider [21, 70, 73]. However, this kind of preprocessing strongly relies on the fact that the given data consist of mass spectra; usually, this even involves an explicit peak detection. Hence, a rigorous analysis seems to be difficult and the corresponding algorithms will mostly fail to work for more general types of data, such as audio or image signals. But we could make of course

32 26 3 Toward Sparse Proteomics Analysis (SPA) Intensity (cts) Mass (m/z) (a) (b) Figure 3.4: Example of (a) raw data and (b) its corresponding scattering coefficients with j = J = 5. Note that the axes of the right plot do not have any labels, since there is no physical meaning. a compromise and combine both peak alignment and scattering transform, depending on how much is known about the origin of the noise. (2) Computational issues: For the implementation of SPA-Scattering, the MATLABtoolbox ScatNet 1 was used. The underlying code makes use of various computational shortcuts and the transforms of SPA-Scattering are actually performed by a circular convolution using the fast Fourier transform (FFT). Furthermore, the definitions of the required parameters might slightly differ from the ones above; this especially concerns the Morlet wavelet filter of (3.4), which is taken from [23]. But for theoretical questions, this obviously makes no difference. 1 Version 0.2, see [4] for a documentation.

33 3.2 SPA at a Glance SPA-Standardize: Standardization by Weighted Interpolation Basically, we are now prepared to perform a feature selection from X scat. 1 However, as already pointed out in (C3), there exists a certain danger that some important biomarkers could be missed due to their small amplitude. The following algorithm proposes a weighted interpolation between X scat and its standardized version: 2 Algorithm 3.4: SPA-Standardize Input: Transformed data X scat R n,d and label vector y { 1, +1} n ; Interpolation parameter α [0, 1]; Correlation parameter c exp 0; Scaling parameter λ > 0 Output: Standardized data X std = [ x std 1... x std n Compute ] R n,d 1 Compute the average spectrum x := 1 n n k=1 xscat k R d and the mean label ȳ := 1 n n k=1 y k R. 2 Center the data (w.r.t. every feature): x k := x scat k 3 Compute the standard deviation of each feature, ( 1 σ l := n n 1/2, x k,l) 2 l [d ], k=1 4 and its empirical correlation with (y 1,..., y n ), ρ l := 1 n n k=1 x R d, k [n]. x k,l y k ȳ σ ( l 1 n n k=1 (y k ȳ) 2) 1/2, l [d ]. 5 Standardize the data by a weighted interpolation: where α l := α cexp(1 ρ l ). x std k,l := α l λ x k,l + (1 α l ) x k,l σ l, k [n], l [d ], All computations of SPA-Standardize are performed feature-wise, i.e., the algorithm operates on each column of X scat separately. The first two lines are responsible for centering the data such that every feature variable has an empirical mean value of 0. We have 1 The following computations are sometimes already considered as a part of the feature selection step. But in order to keep SPA-Select as generic as possible, SPA-Standardize is still a part of the preprocessing. 2 This algorithm involves several (empirical) quantities from probability theory and statistics which are summarized in Definition A.8.

34 28 3 Toward Sparse Proteomics Analysis (SPA) already mentioned in Remark 3.1(1) that this step is necessary to avoid a non-trivial intercept. Afterwards, in Line 5, the actual weighting of the features is performed. At first, let us assume that α = 1, implying that α l = 1. Then we have x std k,l = λ x k,l, which means that SPA-Standardize simply centers the data. If α = 0, we obtain x std k,l = x k,l σ l. In this case, the data is completely standardized, i.e., every feature has an empirical standard deviation of 1. 1 However, both limiting cases for α might cause drawbacks in some situations: When solely restricting to a centering of the data, we might overrate high-amplitude features, whereas a full standardization does not respect the characteristic structure of the data. In fact, a standardization could make the feature selection unstable, since there are usually only a few samples available, which are additionally affected by strong noise; and moreover, even the most discriminative features are often not completely correlated with the disease label. Therefore, it is quite natural to presume that the truth is somewhere in between, or stated in different words, that there is a certain trade-off between the analysis of separability (α = 1) and correlation (α = 0). This perspective directly leads to the idea of interpolating between both limiting cases. The realization in Line 5 computes a convex combination for each feature separately, which involves several additional parameters except for α. Most importantly, it depends on the empirical correlation with (y 1,..., y n ). Each coefficient ρ l [ 1, 1] measures how strong the l-th feature is correlated with the disease. In fact, this can be already seen as a pre-estimate of the later feature ranking in SPA-Select. The main idea behind the definition of α l = α cexp(1 ρl ) is to control the interpolation speed between λ x k,l and x k,l σ l : Let us suppose that we start with α = 1 and then continuously approach α = 0; see also Figure 3.5. If ρ l 0, the exponent of α l becomes relatively large, and consequently, we quickly arrive at α l = 0. On the other hand, when ρ l is larger, the curve α α l becomes increasingly concave and we stay relatively long close to α l = 1. To summarize, stronger correlated features decay to their standardization more slowly than less correlated ones, as α 0. In particular, the amplitude of large, but unimportant peaks falls down very quickly, whereas small, relevant peaks are not that much shrunken. In this way, we are actually able to control the trade-off between separability and correlation. Let us now consider the tuning parameters λ > 0 and c exp 0. The exponent c exp 0 mainly controls the curvature of α α l : When c exp is large, the decay of α l becomes strong, as α 0, and we quickly approach the limiting case of α l = 0. Thus, increasing c exp might help to amplify strongly correlated features. The role of λ can become crucial in some situations. This is due to the fact that x k,l σ l is scaling-invariant 2 while x k,l is not. Thus, if X scat is badly scaled, a weighted interpolation could not bring the desired benefit. It is therefore reasonable to incorporate an additional scaling factor λ. For the specific case of MS-data, most of the features correspond to the 1 Or equivalently, every column of X std has euclidean unit norm. Note that if σ l = 0, we would obviously have x k,l = 0 for all k [n], making a standardization unnecessary anyway. 2 By scaling we mean that the entire matrix X scat is multiplied by a scalar factor.

35 3.2 SPA at a Glance 29 Figure 3.5: Plot of α l as a function of α with c exp = 2 and several choices of ρ l. baseline, being relatively small in magnitude compared to the few peak-related features. Since we are mainly interested in the latter ones, it is evident to choose λ in a such way that the baseline features are only slightly affected by SPA-Standardize. Then, we particularly have that λ x k,l x k,l σ l if l [d ] corresponds to a peak. One possibility to achieve this but this is by far not the only one is to use the (sample) median of the whole data matrix X scat, which is most likely attained at some feature on the baseline. More precisely, we choose (see also Definition A.8) λ = 1 m( X scat ), where the absolute values are taken element-wise and the matrix X scat is considered as a vector. However, it should be clear that, in general, the choice of λ highly depends on the actual structure of the input data. Finally, it should be emphasized that there are numerous alternative ways to define the weighting maps α α l. Our approach was essentially motivated by the correlationweighted speed of interpolation between the limiting cases. However, it is quite unclear what the optimal values of α, c exp, and λ are, and how to actually find them. Thus, SPA- Standardize indeed requires some effort of adaptive tuning. Some ideas and extensions to automate this process are suggested in Section 6.3.

36 30 3 Toward Sparse Proteomics Analysis (SPA) SPA-Select: Sparse Feature Selection So far, we have merely preprocessed the data in order to prepare them for a fingerprint extraction. Before describing the actual process of feature selection, let us recall the linear decision model from Section 3.1, that is, y = sign( x std, ω ), (LDM) where (x std, y) R d { 1, +1} is some sample pair and ω R d a linear classifier (cf. (3.1)). Following our main objective from Section 2.3, we intend to learn a sparse feature vector ω R d such that many of the input pairs (x std 1, y 1),..., (x std n, y n ) satisfy (LDM) (good classification performance for training samples). But simultaneously, ω should also contain biologically relevant information that allows us to use (LDM) for a reliable prediction of the health status of an unknown sample x std R d (good generalization performance). To meet this challenge, SPA offers three different approaches for sparse feature selection, which are presented now. But a detailed discussion of why these methods are indeed successfully working, and which one should be preferred, is postponed again to Chapter 4 (especially to the Sections 4.3 and 4.4). Robust 1-Bit Compressed Sensing Besides machine learning, one could alternatively view (LDM) in the context of 1-bit compressed sensing (CS), which means that we assume that y was acquired as a 1-bit measurement of ω, using the measurement vector x std. The goal is then to recover ω from a set of measurement pairs (x std 1, y 1),..., (x std n, y n ). For a brief introduction to the field of (1-bit) compressed sensing, see Appendix A.5. The following convex program was proposed in [65] as a method for the robust recovery from 1-bit measurements: 1,2 Algorithm 3.5: SPA-Select via Robust 1-Bit Compressed Sensing Input: Standardized data {(x std k, y k)} k [n] (R d { 1, +1}) n ; Sparsity parameter s > 0 Output: Feature vector ω R d Compute ω := argmax ω R d n k=1 y k x std k, ω s.t. ω 2 1 and ω 1 s. (Rob 1-Bit) 1 Here and in the following algorithms, the solution of the optimization problem could be non-unique. Hence, argmin or argmax means that we just pick one solution. The issue of uniqueness will be addressed again in Remark For a brief introduction to (convex) optimization, see Appendix A.4.

37 3.2 SPA at a Glance 31 To get a basic understanding for (Rob 1-Bit), let us first observe that y k x std k, ω > 0 y k = sign( x std k, ω ). Hence, the sum in (Rob 1-Bit) will be particularly large when many samples are correctly classified by ω. However, full consistency with the measurements, i.e., y k = sign( x std k, ω ) for all k [n], is not strictly enforced, which makes the algorithm (to some extent) robust against bitflips. 1 Such a corruption could arise in practice when a physician has wrongly classified a patient from the very beginning. The constraint of (Rob 1-Bit), on the other hand, promotes the effective sparsity of the maximizer ω, depending on the choice of the sparsity parameter s. These observations encourage our intuition that SPA-Select via robust 1-bit compressed sensing indeed provides a sparse feature vector that allows for an appropriate classification of the samples. The idea of applying (Rob 1-Bit) to MS-data was originally suggested in [21]. But due to its simplicity, there are several inevitable drawbacks coming along with this approach (cf. Section 4.3), and therefore, we consider two further methods for feature selection here. The l 1 -Norm Support Vector Machine The concept of support vector machines (SVMs) is based on the principle of optimally separating a set of (labeled) data points by a hyperplane (see also Figure 3.1). This purpose can be easily combined with sparse feature selection, leading to the so-called l 1 -norm support vector machine (l 1 -SVM). For SPA, we use the version of l 1 -SVM which was suggested in [90]: 2 Algorithm 3.6: SPA-Select via l 1 -Norm Support Vector Machine Input: Standardized data {(x std k, y k)} k [n] (R d { 1, +1}) n ; Sparsity parameter s > 0 Output: Feature vector ω R d Compute ω := argmin ω R d n k=1 [ 1 y k x std k, ω ] + s.t. ω 1 s. (l 1 -SVM) The structure of (l 1 -SVM) looks actually very similar to the one of (Rob 1-Bit). Again, the l 1 -constraint is responsible for encouraging a sparse solution, while minimizing the 1 A bit-flip means that the sign of some entry of the ( true ) observation vector y = (y 1,..., y n) is flipped. 2 Recall that [t] + := max{0, t} for t R.

38 32 3 Toward Sparse Proteomics Analysis (SPA) hinge loss, (y k, x std k, ω) [ 1 y k x std k, ω ]+, guarantees a good separation of the two classes. An important difference to (Rob 1-Bit) is the non-linearity of the optimized functional. Indeed, if y k x std k, ω > 1 for some k [n], the corresponding summand does not further contribute to the minimization anymore. Geometrically, this means that x k is correctly classified and lies far behind the margin of the separating hyperplane H(ω, 0). This perspective of margin maximization will be discussed in detail in the course of Subsection The LASSO The least absolute shrinkage and selection operator (LASSO), introduced by R. Tibshirani in [76], is a very popular and widely-used method in statistical learning. Thus, it is particularly interesting to see how SPA performs with the LASSO as feature selection algorithm. Similar to (Rob 1-Bit) and (l 1 -SVM), we present the LASSO as a constrained optimization problem: Algorithm 3.7: SPA-Select via the LASSO Input: Standardized data {(x std k, y k)} k [n] (R d { 1, +1}) n ; Sparsity parameter s > 0 Output: Feature vector ω R d Compute ω := argmin ω R d n (y k x std k, ω )2 s.t. ω 1 s. (LASSO) k=1 The LASSO was originally designed for the regularization of the ordinary least squares fit n (y k x std k, ω )2. min ω R d k=1 The actual purpose here is rather to solve a linear regression problem, that is, to learn a classifier ω R d for a linear model of the form y = x std, ω, y R, x std R d. Although our model in (LDM) is quite different and the output variable y is binary, we will see that, somewhat surprisingly, the LASSO is capable of providing promising fingerprints as well.

39 3.2 SPA at a Glance SPA-Sparsify: Detection of the Connected Components of ω The step of SPA-Select was presented in a very generic fashion, which makes it applicable to more general types of data. However, the fact that the algorithms do not rely on any assumption on the input X std can lead to several difficulties, especially for the 1-bit compressed sensing approach. In fact, one can easily show (cf. Subsection 4.3.2) that the feature selection via (Rob 1-Bit) is nothing else but a rescaled soft-thresholding of the average spectrum x := y X std R d. 1 In the case of MS-data, x usually consists of wide-spread (Gaussian-shaped) peaks; see Figure 2.4 for example. Thus, when selecting features by thresholding, we will not only capture the maximum of a peak but also those features which are very close to it. As Figure 3.6 shows, this leads to consecutive sequences of indices which are selected for the support of ω. In order to avoid such an unnecessary redundancy, we introduce the step of SPA-Sparsify, which detects the maximally connected components 2 of ω and produces a sparsified version of it: Algorithm 3.8: SPA-Sparsify Input: Feature vector ω = (ω 1,..., ω d ) R d ; Threshold ε 0 Output: Sparsified feature vector ω sp = (ω sp 1,..., ωsp d ) R d Compute 1 Perform a hard-thresholding to set extremely small values to 0: ω l := { ω l, ω l > ε, 0, otherwise, l [d ]. 2 Detect the maximally connected components C 1,..., C D supp( ω) [d ]. 3 Define ω sp R d by ω l, if l C m for some m [D] ω sp l := and l = argmax l C m ω l, 0, otherwise, l [d ]. The necessity of the hard-thresholding in Line 1 is due to possible computational inaccuracies: It could happen that some numerical solvers deliver solutions containing very small entries which would be actually equal to 0 in theory. Concerning the sparsification in Line 3, we should first note that the definition of ω sp 1 Since X std is centered, we have x = 2 k G + x std k = 2 k G x std k, which is proportional to the average taken over one of the groups (either G + or G ). 2 A connected component C [d ] of ω is a subset of supp(ω) which only consists of consecutive indices. Moreover, C is maximally connected if for any further connected component C C of ω we actually have C = C.

40 34 3 Toward Sparse Proteomics Analysis (SPA) Figure 3.6: Illustration of feature selection by soft-thresholding of a spectrum. The support of ω is marked by the red bullets on the bottom line, which are also projected onto the data vector. is well-defined because the components C 1,..., C D are mutually disjoint. The actual computation is very simple: For each connected component C m, we keep the maximal entry of ω, restricted to C m, and set all others to 0. In the example of Figure 3.6, the output vector would then just contain two features, namely those which correspond to the maxima of the peaks. For this reason, it is sometimes more appropriate to refer to the number of connected components of ω when speaking of a sparse feature vector. Unfortunately, the benefit of SPA-Sparsify heavily depends on the quality of the average spectrum and in several situations, e.g., when the peak centers strongly deviate, it might become unstable. While this step is inevitable for (Rob 1-Bit), we will see that (l 1 -SVM) and (LASSO) are able to avoid redundant selections on their own (see Subsection 4.4.3). Hence, we will usually omit Line 3 of SPA-Sparsify for these two approaches.

41 3.2 SPA at a Glance SPA-Recover and SPA-Reduce: Feature Recovery from Scattering Coefficients and Dimension Reduction The following two optional steps of SPA deal with the question of how to proceed with a feature vector. Perhaps, the most important challenge is to extract potential biomarkers from ω sp. For this purpose, it could be particularly useful to map the vector ω sp back to the d-dimensional raw-data-space, where the feature variables can be interpreted more easily. Due to the windowed averaging, the transformation of SPA-Scattering is clearly not one-to-one, and therefore, we cannot expect a unique result here. A very elementary approach for such a feature recovery looks as follows: Algorithm 3.9: SPA-Recover Input: Feature vector ω sp = (ω sp 1,..., ωsp d ) R d ; Window scale J N 0 (must be the same as in SPA-Scattering) Output: Recovered feature vector ω = (ω 1,..., ω d ) R d Compute Define ω R d by { ω sp l, if l = 2 J (l 1) + 1 for some l [d ], ω l := 0, otherwise, l [d]. This computation simply expands ω sp with additional zero-entries in order to compensate the subsampling rate of 2 J in Line 2 of SPA-Scattering. The support of the resulting vector ω R d might sometimes not exactly correspond to peak-related features, but a physician should be usually able to identify the underlying biomarkers manually. On the other hand, one could also improve the precision of SPA-Recover by a finer subsampling in SPA-Scattering (oversampling), but this would obviously cause higher computational costs. A second aspect concerns the prediction of the health status of unknown samples by using the selected features. From the perspective of clinical research, this issue is perhaps a bit less important because MS-data is often difficult to reproduce (cf. (C1) in Section 2.3), making the classification results also depend on external factors. The prediction accuracy is however a useful (and sometimes the only available) measure to evaluate the quality and relevance of a feature vector. In real-world scenarios, we usually do not know what the true biomarkers are; a cross-validation 1 could therefore indicate whether the selected features are really correlated with the disease, or whether there is rather a spurious correlation that is just a coincidence of the noise. The latter point is closely related to the problem of overfitting, which we already discussed in the course of Section 2.3; recall, that we do not primarily aim for a perfect classification of a (small) training set but rather for an accurate prediction of unknown samples. 1 See Section 5.2 for details on the technique of cross-validation.

42 36 3 Toward Sparse Proteomics Analysis (SPA) The step of SPA-Select tries to satisfy this demand by encouraging sparse solutions, which hopefully contain only a few (biologically) relevant features. In this way, (almost) all unimportant feature variables are rejected and we can finally perform a dimension reduction by projecting the data onto supp(ω sp ): Algorithm 3.10: SPA-Reduce Input: Feature vector ω sp = (ω sp 1,..., ωsp d ) R d ; Raw mass spectrum x = (x 1,..., x d ) R d Output: Projected spectrum x red R D, where D = ω sp 0 = # supp(ω sp ) Compute 1 Apply SPA-Normalize and SPA-Scattering to x to obtain x scat R d. 2 Let supp(ω sp ) = {l 1,..., l D } [d ]. Then, we put x red := (x scat l 1,..., x scat l D ) R D. Note that one could optionally omit the application of SPA-Scattering in Line 1 and project the raw data directly (using the recovered vector ω from SPA-Recover); but working in the feature space seems to be more appropriate because this does not rely on any back transformation, such as SPA-Recover. By the above projection step, the dimension of the ambient space is dramatically reduced to D d. Eventually, the problem of classifying and clustering 1 becomes much easier, especially because almost all classical tools from machine learning are available in such a low-dimensional setting. In the experiments of Section 5.2, for instance, we do not directly apply the feature vector ω sp as a classifier, 2 but instead, the test data set is first projected onto a low-dimensional space by SPA-Reduce and then classified by an ordinary SVM. The question of how to proceed with the output of SPA is actually an interesting challenge on its own and heavily depends on the specific application. However, this issue is somewhat too general for our purposes, and therefore, it will not be discussed in this thesis anymore. 1 A typical example of clustering is when we do not only care for a binary disease label but also for several (unknown) subtypes of a disease. 2 This means that the label of an input data vector x scat is predicted by (LDM) with ω sp.

43 4 Theory for SPA This chapter is devoted to investigating the theoretical foundations of SPA. At first, in Section 4.1, a statistical model for proteomics MS-data is introduced and discussed. Based on this theoretical model, we will then verify the benefit of the individual SPAcomponents, where a special focus is on the steps of SPA-Scattering (Section 4.2), SPA-Standardize, and SPA-Select (Sections 4.3 and 4.4). However, the reader should not expect an explicit theorem here, stating the success of the fingerprint detection. Such a result would require a much deeper analysis of the underlying model and may even ask for developing a new theory. This challenge would unfortunately go beyond the scope of this thesis, but could be a part of future work. 4.1 Statistical Models for Proteomics Data Before we can actually start analyzing the theoretical aspects of SPA, we clearly need to define an appropriate model for (proteomics) mass spectra, which hopefully reflects many characteristic properties of real-world data. Of course, the development of such a model should involve a certain degree of randomness because we are never in the situation of having complete knowledge about a human individual. Moreover, many theoretical conclusions will be affected by an additional uncertainty because the number of available samples, which were drawn from our model distribution, is always finite A Correlation-Based Forward Model As the title of this subsection already indicates, we would like to design a forward model 1 which explicitly specifies how our proteomics data is generated. For this purpose, it is useful to recall the content of Section 2.2 (especially the Figures 2.2 and 2.4) which sketches the technique of mass spectrometry as well as the characteristic structure of mass spectra. This description already suggests a very informal definition of a proteomics forward model: Spectrum = Sum of (Gaussian) peaks + Baseline noise. (4.1) 1 This also known as a generative model, which is actually more common in literature. But this section is substantially inspired by [41], where the authors prefer forward model. Thus, we adopt this convention here. 37

44 38 4 Theory for SPA Each single peak corresponds to a certain type of molecule. Thus, we can assume that the entire set of peaks, which may occur in the data, is represented by a finite collection of feature atoms a 1,..., a M R d. In order to keep our theoretical analysis technically simple, we restrict only to the case where a 1,..., a M are discrete samples of Gaussianshaped functions, i.e., 1 with a m = (a m,1,..., a m,d ) := (G cm,β m (1),..., G cm,β m (d)) R d, m [M], (4.2) G cm,βm (t) := exp ( (t c m) 2 ), t R, c m R, β m > 0. (4.3) β 2 m Note that the feature atoms a 1,..., a M are sometimes also called activation patterns in literature (cf. [41]). The Gaussian function in (4.3) does not include a scalar factor that determines the peak height. In fact, while the definition of the feature atoms is deterministic, their amplitude shall be expressed by random variables, which are somehow correlated with the health status of the corresponding sample. For this purpose, let (Ω, A, P) be the underlying probability space. 2,3 Then, the peak amplitudes are denoted by real-valued random variables S 1,..., S M L 2 (P), which are usually called latent factors (cf. [41]). As a representation of the baseline noise, we additionally introduce a random vector N = (N 1,..., N d ): Ω R d with N 1,..., N d L 2 (P). Now, we can make the vague model of (4.1) precise and define a (random) mass spectrum by X(ω) = (X 1 (ω),..., X d (ω)) := M S m (ω)a m + N(ω) R d, ω Ω. (4.4) m=1 The corresponding disease label, on the other hand, is indicated by another binary random variable Y : Ω { 1, +1}. Remark 4.1 (Notation) We will usually omit the argument ω (which should not be mixed up with the entries of a feature vector ω). Moreover, random variables are mostly indicated by capital letters, whereas lowercase is rather used for deterministic quantities. Up to now, we did not make any specific restrictions on S 1,..., S M, N, or Y. In particular, the disease label Y does not explicitly appear in (4.4). However, there are some assumptions that seem to be reasonable for our general proteomics model. At first, we shall assume that the disease indicator Y is symmetric, i.e., P(Y = 1) = 1/2 = P(Y = +1), or equivalently, E(Y ) = 0 (see also Example A.11). Similarly, the baseline noise should be centered, meaning that E(N) = 0, and each entry N l is assumed to be independent of the disease label Y as well as of the latent factors S 1,..., S M. At a first sight, supposing that 1 Again, we do not care about the physical meaning of the indices here, since this has no consequences for the actual mathematical theory. 2 As usual in probability theory, this space is not further specified and we always assume that it is rich enough to model all required random variables. 3 For a brief introduction to basic notions from probability theory, see Appendix A.3.

45 4.1 Statistical Models for Proteomics Data 39 E(N) = 0 (and also the fact that the latent factor may take negative values) appears somewhat inappropriate, but since the input data are centered by SPA-Standardize anyway, this simplification does not cause any loss of generality. The actual connection between a sample spectrum X and its label Y is therefore completely encoded by the disease correlation coefficients ( ) ρ Sm := ρ(y, S m ) = E Y Sm E(S m ) [ 1, 1], m [M]. (4.5) V(Sm ) This correlation-based relationship is quite natural because a pure analysis of MS-data does not (actually never) allow for causal conclusions on the underlying hidden mechanisms of a specific disease. Hence, the best what we can hope for is to determine the correlation coefficients, which could at least indicate the relevance of certain features. So far, we have strictly followed the linear approach of [41, 62], where the activation patterns a 1,..., a M are fixed in advance. According to the challenge of (C2) from Section 2.3, such a model could be actually too simple for an appropriate representation of real-world data. Indeed, to incorporate (small) peak shifts, there needs to be some randomness involved in the feature atoms as well. This can be easily realized by replacing (4.2) with a m (ω) = (G Cm(ω),β m (1),..., G Cm(ω),β m (d)) R d, ω Ω, m [M], where C m L 2 (P) is a randomized peak center with c m = E(C m ) for m [M]. Adapting (4.4) to this extension, we finally obtain the proteomics forward model 1 X(ω) = M S m (ω)a m (ω) + N(ω) R d, ω Ω, (FwM) m=1 which is perhaps the most important definition of this chapter. Table 4.1 contains a compact overview of all notations that were introduced in this subsection Discussion and Coherence Between Features In order to get some intuition for our abstract proteomics model, let us first consider an easy example. This realization of (FwM) particularly describes how the MS-data for the experiments in Section 5.1 is generated. Example 4.2 (A Realization of the Proteomics Model) In an idealized world, we could think of the healthy group (Y = +1) as a set of individuals with a perfect health status, which particularly means that all probands have (almost) the same proteome. For a diseased patient (Y = 1), on the other hand, we may assume that each single 1 Although a 1,..., a M are now random variables, we deny using capital letters here, since the influence of randomness is relatively low (compared to the latent factors) and, on the other hand, we would like to emphasize the connection to the linear model in [41].

46 40 4 Theory for SPA Model Component Mathematical Notation Mass spectrum X = (X 1,..., X d ) R d Disease label Y { 1, +1} Feature atoms, activation patterns (indexed by m [M]) a m = (a m,1,..., a m,d ) R d Latent factors S 1,..., S M R Baseline noise N = (N 1,..., N d ) R d Random peak centers C 1,..., C M R Gaussian peak functions t G cm,βm (t) = exp( (t c m ) 2 /βm) 2 Table 4.1: A summary of the notations that we need for the proteomics forward model. Note that all quantities (except for the Gaussian peak function) are random variables. (disease) biomarker has a certain probability of activation, and the concentration of the corresponding protein changes by a multiplicative factor. These assumptions are of course only an approximation of truth, simply because of that fact that a proteome is partially encoded by the DNA, which is unique for every human individual. However, we know that many real-world data sets contain only a very small set of significant biomarkers (cf. Section 2.3). This particularly implies that the remaining feature variables are rather weakly correlated with the disease, so that they can be considered as a certain type of (almost) independent noise. For this reason, such a simple model is at least suited as a numerical benchmark for the performance of SPA (see Section 5.1). In order to formalize this model, we introduce latent factors of the following form: S m = I m 1 {Y =+1} + I m (1 + s m T m ) 1 {Y = 1}, m [M]. (4.6) Here, I 1,..., I M L 2 (P) are random variables which determine the intensities of the healthy samples. The activation probability of the m-th feature atom is modeled by a Bernoulli variable T m Ber(p m ) with p m [0, 1]; and the (fixed) scalar factor s m R specifies how much the peak amplitude deviates when the corresponding biomarker is activated. Figure 4.1 illustrates the different events that may occur in (4.6). Moreover, we shall assume that I 1,..., I M, T 1,..., T M and Y are all independent. However, this does not imply that the latent factors S 1,..., S M are independent of Y. In fact, one can show that there is a one-to-one (and monotone) correspondence between the activation probability p m and the disease correlation coefficient ρ Sm = ρ(y, S m ). Thus, the relevance of the feature atoms can be equivalently ordered in terms of the set {p 1,..., p M }. The entries of the baseline noise vector N = (N 1,..., N d ) are assumed to be i.i.d. with N l N (0, σnoise 2 ) for some σ2 noise > 0. Similarly, the peak shifts shall be normally distributed as well, i.e., C m N (c m, σshift 2 ), m [M], but here we do not require the independence of the variables. Let us now return to the abstract viewpoint of Subsection Our ultimate goal is to understand how MS-data is generated and in which way it is correlated with a certain disease; thus, we would like to learn the unknown (joint) distribution of (X, Y ). This

47 4.1 Statistical Models for Proteomics Data 41 E(I m ) (1+ 0.5) E(I m ) Healthy Diseased (not activated) Diseased (activated) Figure 4.1: Illustration of Example 4.2 (for the m-th feature atom). The amplitude of the peak in the healthy case (left) has an expectation of E(I m ). This is also the case when the sample is in the diseased group but the corresponding feature is inactive (middle). On the other hand, if the feature is activated (right), the amplitude changes to I m (1 + s m T m ); in this example, we have s m = 0.5. Note that in all three cases, the peaks are actually located around the same center c m = E(C m ). challenge can be made more precise by our forward model in (FwM): First of all, we are interested in the latent factors S 1,..., S M and their respective disease correlation coefficients ρ S1,..., ρ SM. But one the other hand, the feature atoms a 1,..., a M are important as well because they contain the spatial information that allows for the actual identification of biomarkers. Thus, our task is to estimate both ρ Sm and a m. However, this problem is sometimes impossible to solve, which is especially due to the fact that we do not claim the independence of S 1,..., S M. In fact, it could even happen that S m = S m for m m, making the representation of X by (FwM) non-unique. Such a difficulty may occur in practice, for instance, when many diseased patients are treated by a specific drug. This drug could then have a side effect, increasing the abundance of certain proteins in the blood and eventually leading to feature atoms which are strongly correlated with Y. But this would be rather a misleading correlation because the side effect is not directly related to the disease. This simple example illustrates that a certain redundancy in the feature selection step is sometimes necessary to avoid missing important features. On the other hand, there is a different type of redundancy which should be eliminated: The Gaussian shape of the peaks (cf. (4.3)) implies that each feature atom has a certain spatial extent. Hence, assuming that a m corresponds to a relevant biomarker, every feature in the (essential) support of a m could equally well contribute to the separation of the samples and is therefore a potential candidate for a fingerprint. But the actual goal should be to choose only one of these features preferably the most significant one in order to satisfy the demand for a sparse and interpretable selection process. We will see in the Sections 4.3 and 4.4 that (l 1 -SVM) and (LASSO) are indeed able to select features in a non-redundant way, whereas

48 42 4 Theory for SPA (Rob 1-Bit) requires additional postprocessing by SPA-Sparsify to achieve this. The previous two paragraphs have focused on cross-correlations between features, that is, ρ(x l, X l ) for l, l [d]. In this case, we speak of coherence between features, rather than of correlation 1 this is particularly to avoid confusion with the correlation of single features with the disease label Y, i.e., ρ(x l, Y ). Furthermore, we will distinguish between two types of coherence: On the one hand, there is a coherence of Type I which refers to the correlation between different feature atoms, expressed by ρ(s m, S m ) for m m. A typical example for this would be spurious correlations that may arise from a drug treatment (see above). On the other hand, the cross-correlation between features which belong to the same atom a m is called coherence of Type II. Especially the coherence of Type I makes it extremely difficult to find a decomposition of the input data which appropriately approximates the forward model of (FwM). In general, there are various strategies to meet this challenge. One approach would be to adapt the method specifically to MS-data, for instance, by a peak detection. However, such an adaptive tuning is not our primal goal because it would hinder a general applicability as well as a mathematically sound theory. Instead of that, we are rather aiming for generic components, such as SPA-Scattering and SPA-Standardize, whose benefit will be verified in the subsequent sections. Another important aspect concerns the actual structure of MS-data: As already mentioned in Example 4.2, real-world mass spectra are usually containing only a very small set of discriminative features, which is even much smaller than the total number of peaks. In the context of our forward model, this means that the vector of disease correlations ρ = (ρ S1,..., ρ SM ) [ 1, 1] M contains just a few entries which are close to +1 or 1, and the rest is relatively close to 0; thus, ρ is actually approximately sparse. This observation is particularly useful in practice because it shows that a promising feature selection is possible, even when not all feature atoms are detected by the algorithm A Simplified Backward Model In the previous subsections, we have tried to model mass spectra as detailed as possible, namely in terms of the joint probability distribution of (X, Y ). But in the setting of supervised learning, we are just given a sample set {(x k, y k )} k [n] drawn from (X, Y ), and in particular, there is no information available on the underlying feature atoms or latent factors. Although the ultimate challenge is still to estimate the unknown pairs {(a m, S m )} m [M], the actual feature selection method of SPA shall not explicitly strive for this goal; indeed, this would always rely on very strong assumptions on the input data and particularly contradicts the wish for a generic framework. The algorithms of SPA-Select are therefore based on the much simpler classification model y = sign( x, ω ), y { 1, +1}, x, ω R d, (BwM) 1 The notion of (mutual) coherence goes back to D. Donoho and X. Huo in [24] and became especially popular in the theory of sparse representations and recovery; see [27] for instance.

49 4.2 Robustification via the Scattering Transform 43 which we have already discussed in Chapter 3 (see (3.1) and (LDM)). Compared to the forward model of (FwM), there are two remarkable differences: On the one hand, we observe that (BwM) is a much more general approach which does not encode any prior knowledge about the input pair (x, y), such as activation patterns or latent factors. On the other hand, (BwM) is a typical example of a backward model (or discriminative model): In contrast to forward models, one is not primarily interested in the generation of a sample (x, y) here, but rather in extracting the label y from a given datum x. 1 The vector ω plays the role of an extraction filter in (BwM), meaning that it tries to filter out those features which are relevant for explaining the observation y (cf. [41]). The most difficult challenge when using (BwM) concerns the interpretability of the feature vector ω. Compared to the pairs of {(a m, S m )} m [M], which have a clear biological meaning, one has to be careful when extracting information from ω. In fact, there are several examples in machine learning where interpreting the most significant entries of ω can lead to wrong conclusions on the actual signal-of-interest. The issue of drawing relevant conclusions from a feature vector ω will therefore ask for special attention when analyzing SPA-Select in the Sections 4.3 and 4.4. A more general and well-illustrated discussion of the interpretability of features can be found in [41]. 4.2 Robustification via the Scattering Transform The previous section was presented in a rather general setting and the SPA-framework of Chapter 3 was only marginally involved. With the proteomics model of (FwM), we are now able to perform a much more rigorous analysis of SPA. At first, we return to the step of SPA-Scattering, in order to verify that the scattering transform is indeed able to make MS-data robust against local peak shifts (see Subsection 4.2.2). Before, it might be beneficial to consider the scattering transform in its general form The Scattering Transform Originally introduced by S. Mallat in [57], the scattering transform (ST) was primarily applied to problems from audio- and image-processing, most notably, handwritten digit recognition and texture classification (cf. [5, 10, 12]). For an illustration, let us consider the handwritten digit example of Figure 4.2: Every single image is clearly affected by a certain translation as well as local deformations (displacements). Thus, it is not obvious how to compare two images with each other in a reasonable way. The idea of the scattering transform is now to compute a representation of the data that satisfies the following three properties: (ST1) Translation invariance. (ST2) Stability against local deformations. 1 From the perspective of statistics, a forward model tries to describe the entire distribution of (X, Y ), whereas a backward model merely corresponds to the conditional expectation E(Y σ(x)).

44 4 Theory for SPA Figure 4.2: An example of handwritten digits taken from the MNIST database [52]. The main goal here is to predict an unknown digit, i.e., the underlying image shall be mapped to its corresponding integer.

50 44 4 Theory for SPA Figure 4.2: An example of handwritten digits taken from the MNIST database [52]. The main goal here is to predict an unknown digit, i.e., the underlying image shall be mapped to its corresponding integer. (ST3) Preservation of high-frequency content, in order to retain important information on the data. Following [57], we will only consider the continuous setting here, meaning that all functions are defined on the real line, and not on a discrete index set. 1 Thus, let f L 1 (R) L 2 (R) be some real-valued signal and ψ L 1 (R) L 2 (R) be a (complex) wavelet. Now, we consider the continuous wavelet transform 2 W[j]f = f ψ j L 1 (R) L 2 (R) (4.7) at some scale j Z. Since wavelets are regular and localized functions, the mapping f W[j]f is stable under the action of local diffeomorphisms (cf. [57]), which particularly implies the property (ST2). Furthermore, by varying the wavelet scale j, we are able to capture the high-frequency content of f, as desired in (ST3). But the wavelet transform obviously commutes with translations, and therefore, we are actually far away from satisfying (ST1). A very naive approach to make (4.7) translation invariant would be to take the integral mean: [W[j](T c f)](t) dt = f(t u c)ψ j (u) du dt R R R = f(t u)ψ j (u) du dt = [W[j]f](t) dt, (4.8) R R where T c f(t) := f(t c) defines the translation operator for c R. However, every appropriate wavelet filter has at least one vanishing moment, that is, ψ j dt = 0. Thus, we particularly have W[j]f dt = 0 and (4.8) becomes trivial. R 1 Since MS-data is one-dimensional, we will not deal with the more general (and technical) case of multivariate functions, which is also covered by [12, 57]. 2 For an introduction to the required wavelet theory and its notational conventions, see Appendix A.2.

51 4.2 Robustification via the Scattering Transform 45 In order to obtain a non-trivial translation invariant representation, it seems to be indispensable to combine the wavelet transform with some non-linearity (cf. [12]). For this, let us consider another operator M: L 1 (R) L 2 (R) L 1 (R) L 2 (R) which commutes with translations as well. Then, M(W[j]f) dt is clearly translation invariant. Interestingly, a necessary condition for this integral to be non-zero is that M must be a non-linear operator. Indeed, this can be easily seen by the following theorem from [36, Theorem ]: 1 Theorem 4.3 Let M be a continuous and linear operator on L 2 (R). Then M commutes with translations if and only if there exists a bounded function m: R R such that (Mg) = m ĝ for all g L 2 (R). 2 Using this result, and assuming that M is bounded and linear on (a dense subspace of) L 2 (R), we obtain M(W[j]f) dt = (M(W[j]f)) (0) = m(0) (W[j]f) (0) = m(0) W[j]f dt = 0. R Consequently, it makes only sense to look for a non-linear transformation M. To guarantee that M(W[j]f) dt satisfies (ST2), M should even commute with actions of displacement fields: For any τ : R R and g L 2 (R), one defines the generalized translation operator by [T τ g](x) := g(x τ(x)), x R. Then, we assume that M(T τ g) = T τ (Mg) for all those functions τ C 2 (R), for which Id τ is a diffeomorphism on R. For stability issues, we shall additionally impose that M is non-expansive, i.e., Mg Mh 2 g h 2 for g, h L 2 (R), and that the L 2 -norm is preserved, Mg 2 = g 2. With these assumptions, one can prove (see [11, Theorem 2.4.2] and [57]) that M must be actually the complex modulus, that is, [M(g)](t) = g(t) = Re(g(t)) 2 + Im(g(t)) 2, g L 2 (R), t R. To illustrate the impact of the complex modulus, let us suppose that ψ(t) = e iξ 0t θ(t), where ξ 0 > 0 and θ is a real-valued function which is concentrated around the origin. 3 Then, we can write the wavelet transform as ] [W[j]f](t) = e 2 j iξ 0t [(e 2 j iξ 0 [ ] f) θ j (t), R 1 From now on, we will frequently work with the (continuous) Fourier transform. For a brief introduction to the Fourier transform and its most important properties, see Appendix A.1. 2 In this setting, m is usually called a Fourier multiplier. 3 Note that this is only approximately true for the case of the Morlet wavelet ψ Mor, defined in (3.4).

52 46 4 Theory for SPA and, by taking the modulus, the modulation term is finally removed: W[j]f = (e 2 j iξ 0 [ ] f) θ j. (4.9) This effect is called a lower-frequency envelope in [57]; indeed, we observe that ( (e 2 j iξ 0 [ ] f) θ j ) (ξ) = f(ξ + 2 j ξ 0 ) θ(2 j ξ), ξ R, which means that the frequency content of f at scale j is propagated to the origin of the Fourier domain and then captured by the low-pass filter θ j. Note that it is sufficient to consider only positive values for ξ 0 in this situation, since we have f( ξ) = f(ξ) for all ξ R. So far, we have constructed (continuous) coefficients W[j]f dt that meet the challenges of (ST1) and (ST2). But the integration W[j]f dt = ( W[j]f ) (0) actually ignores the high-frequency values of W[j]f, which contradicts (ST3). 1 A key step of the scattering transform is therefore to recover the lost frequency content by computing the wavelet coefficients of W[j]f, i.e., W[j ]( W[j]f ) = f ψ j ψ j for some scale j Z. Then, in order to satisfy (ST1) and (ST2), we take the complex modulus and compute the integral again: f ψ j ψ j dt. R Iterating this approach leads to a non-linear cascade of wavelet transforms, which forms the basis of the definition of the scattering transform: Definition 4.4 ([57]) Let f L 1 (R) L 2 (R). For a path (or node) p = (j 1,..., j r ) Z r, r 1, we define the scattering propagator by U[p]f := f ψ j1 ψ j2 ψ jr, and U[ ]f := f. An integration then yields the associated scattering coefficients S[p]f := [U[p]f](t) dt. (4.10) R The scattering transform of f is finally obtained by collecting the coefficients of all paths as a sequence Sf := (S[p]f) p P, where P := {p = (j 1,..., j r ) r 0, j 1,..., j r Z} is the set of finite paths. 1 One might guess that some information of W[j]f is already lost when applying the modulus in (4.9). But surprisingly, this is not the case because the signal f can be essentially recovered from ( f ψ j ) j Z, as shown in [82].

53 4.2 Robustification via the Scattering Transform 47 One can now show that the scattering representation of f does indeed satisfy all three desired properties from above. But there are at least two difficulties when working with Sf: Most importantly, the integral mean in (4.10) removes all spatial information that is contained in the wavelet coefficients. This step was obviously necessary to fulfill (ST1), but in the case of MS-data, a certain spatial localization is essential, while a full translation invariance is not required to fix small shifts of the peak centers. Another drawback of Definition 4.4 is that it only works for functions in L 1 (R) L 2 (R), but we are actually aiming for an L 2 (R)-theory. For this purpose, [57] introduces a windowed version of the scattering transform. The underlying idea here is to relax the integration of (4.10) by a windowed averaging, using a low-pass filter φ L 2 (R) which is associated with the wavelet filter ψ: Definition 4.5 Let f L 2 (R). For a fixed window scale J Z, we consider the bounded path set P J := {p = (j 1,..., j r ) r 0, j 1,..., j r J} P. Then, we define the windowed scattering transform S J [p]f := U[p]f φ J = f ψ j1 ψ j2 ψ jr φ J, (4.11) for p = (j 1,..., j r ) P J. The windowed scattering transform at scale J is then given by the sequence S J [P J ]f := (S J [p]f) p PJ. Since ψ L 1 (R) L 2 (R), the windowed scattering transforms in (4.11) are well-defined bounded functions (see also Theorem A.6). In particular, S J [p]f is, compared to S[p]f, not a single number anymore. In the following, we will see some key results of [57] showing that (ST1) (ST3) are asymptotically satisfied for S J [P J ]f. For this purpose, we need to impose some technical assumptions on the wavelet and low-pass filter at first (cf. [57, Theorem 2.6]): Definition 4.6 Let ψ L 1 (R) L 2 (R) be a wavelet filter and φ L 2 (R) the associated low-pass filter. Then, ψ is called admissible if it is unitary, i.e., f 2 2 = f φ J W[j]f 2 2, f L 2 (R), j<j and there exists ξ 0 R as well as a non-negative function ρ, with ρ(ξ) φ(2ξ) and ρ(0) = 1, such that the function Ψ(ξ) := ρ(ξ ξ 0 ) 2 k N k (1 ρ(2 k (ξ ξ 0 )) 2 ) satisfies inf 1 ξ 2 j Z Ψ(2 j ξ) ψ(2 j ξ) 2 > 0.

54 48 4 Theory for SPA Before stating the actual result, it is useful to introduce another notation from [57]: For P P J, we denote S J [P]f := (S J [p]f) p P and U[P]f := (U[p]f) p P, and we define the norms S J [P]f 2 2 := S J [p]f 2 2 and U[P]f 2 2 := U[p]f 2 2. p P p P Moreover, arithmetic operations on the scattering transform are always understood to be component-wise. The following theorem summarizes the main (asymptotic) properties of the windowed scattering transform. Detailed proofs of all statements can be found in [57]. Theorem 4.7 Let ψ L 1 (R) L 2 (R) be an admissible wavelet with an associated low-pass filter φ L 2 (R). Then, the following statements hold for any f L 2 (R): (1) Translation invariance: For all c R, we have S J [P J ]f S J [P J ](T c f) 2 0, J. (4.12) (2) Energy preservation and decay: Let P r J := {p P J p = (j 1,..., j r ) Z r } be the set of all paths with length r 0. Then, we have and S J [P J ] 2 = f 2. U[P r J]f 2 0, r, (3) Stability against deformations: There exists a constant C > 0 (independent of f), such that for any τ C 2 (R) with τ 1 2, we have where S J [P J ]f S J [P J ](T τ f) 2 C U[P J ]f 1 K(τ), (4.13) K(τ) := 2 J τ + τ max{log( τ τ ), 1} + τ with τ := max t,u R τ(t) τ(u), and U[P J ]f 1 = U[PJ]f r 2. r=0 The first statement of Theorem 4.7 is actually a special case of the third one by choosing τ c. Hence, the decay of (4.12) can be (asymptotically) bounded by 2 J τ = 2 J c from above, implying that translations by c 2 J are basically negligible. This particularly shows that the choice of the window size (or equivalently J) has a great influence on the translation invariance of the scattering transform. The property of Theorem 4.7(3) is usually called the Lipschitz-continuity to actions of diffeomorphisms (cf. [12, 57]); when J is large, the estimate of (4.13) is dominated by ( τ + τ ), which can be seen as a measure for the strength of the displacement field τ.

55 4.2 Robustification via the Scattering Transform 49 Figure 4.3: This figure is taken from [57]. It illustrates that the scattering propagation can be visualized by a tree structure. Each node indicates the scattering coefficients which were achieved by following the respective path. The collection of all nodes at the same depth then corresponds to a layer of the scattering transform. Finally, the second part of Theorem 4.7 is closely related to (ST3). Indeed, S J [P J ] preserves the norm of f, but the energy 1 of the scatting propagator simultaneously decreases as path length r grows. Thus, the (frequency) content that is captured by very long cascades of wavelet transforms can be neglected at some point. It is therefore sufficient to consider just a small subset of P J ; in fact, the author of [57] shows that the energy of the scattering transform is mostly concentrating along frequency-decreasing paths, i.e., p = (j 1,..., j r ) PJ r with j 1 < j 2 < < j r. This is also one reason why the STimplementation of ScatNet restricts to decreasing paths, where the maximal length r is actually limited to 3. Remark 4.8 (1) The scattering transform is an example of a convolutional neural network, which were originally considered in [53]. A convolutional network, in its general form, is a cascade of convolutions with appropriate kernel functions (which are wavelets in our case) and a pooling non-linearity (this is the complex modulus for the scattering transform). Such an architecture can be comfortably visualized by a tree (see Figure 4.3). Recently, this approach has been successfully applied to various recognition tasks ([53]) and visual perception ([8, 68]). (2) The statements of Theorem 4.7 are only of asymptotic nature. Therefore, we might ask whether the windowed scattering transform also converges in some sense to the original scattering coefficients from Definition 4.4. This is indeed the case when we embedded the sets P J, J Z, into the uncountable set of infinite paths P := Z N. Then, one can define a measure and a metric on P and prove that S J [P J ]f converges to Sf L 2 (P ) for every f L 2 (R). For details, the interested reader is referred to [57, Section 3]. (3) The very recent work of [83] adapts the idea of the scattering transform to semidiscrete frames, including other important representation systems, such as Gabor frames, curvelets, shearlets, and ridgelets. This particularly allows us to capture more complicated geometric structures than point singularities, which can be resolved by wavelets. 1 The energy of a function is referred to its integral content, usually measured by the L 2 -norm.

56 50 4 Theory for SPA Moreover, by using continuous frame theory, the authors of [83] could even show that the invariance results of Theorem 4.7 can be generalized to this abstract setting Application to Proteomics Data With this general framework in mind, we now return to the analysis of SPA. Recalling the computation of (3.2), we easily observe that SPA-Scattering is a special case of the windowed scattering transform from Definition 4.5. Here, p = (j) is simply a path of length one, and thus, we are considering only one single layer of scattering coefficients. Furthermore, we have proposed to use the Morlet wavelet ψ Mor (t) = π 1/4 (e iξ 0t e ξ2 0 /2 ) exp( t 2 /2), t R, (4.14) for SPA-Scattering (cf. (3.4)), which does not satisfy the admissibility condition of Definition 4.6 and is not even unitary. 1 However, due to its simplicity, this wavelet filter is widely-used in practice and also performs empirically well. Let us now investigate how the scattering propagator affects MS-data. Since wavelet filters are well-localized, it is sufficient to consider a single Gaussian peak as signal, i.e., we assume that 2 f(t) = I exp( t 2 /β 2 ). For the following computations, we will take advantage of the fact that the common choice of ξ 0 = 5 (see Subsection 3.2.2) makes the second summand of (4.14) almost negligible, implying that ψ Mor (t) π 1/4 e iξ 0t exp( t 2 /2). Then, we can proceed as in (4.9): U[j]f = W[j]f = f ψ Mor j π 1/4 (e 2 j iξ 0 [ ] f) ( 2 j exp ( ( )2 2 2j+1 )). To see the actual impact of this convolution, it is again useful to work with the Fourier representation: ((e 2 j iξ 0 [ ] f) ( 2 j exp ( )) ) ( )2 2 (ξ) = 2π f(ξ + 2 j ξ 2j+1 0 ) exp ( (2j ξ) 2 ) 2 ) ) = 2 π I β exp ( (β(ξ+2 j ξ 0 )) 2 4 }{{} =:G f (ξ) exp ( (2j ξ) 2 2 }{{} =:G ψ (ξ). (4.15) Figure 4.4 visualizes this product: As already stated in the general context of Subsection 4.2.1, the frequency content of f at scale j is pushed to the origin and then trapped by a sharp Gaussian filter G ψ. This interpretation particularly justifies the name scattering propagator because the energy of f is propagated to a lower-frequency part here. 1 But there actually exist sophisticated constructions of admissible wavelets, for instance, an analytic cubic spline Battle-Lemarié wavelet (cf. [57]). 2 Since U[p] commutes with translations, we can assume w.l.o.g. that the peak center c R of f is equal to 0 here.

57 4.2 Robustification via the Scattering Transform 51 G f Gψ 2 j ξ 0 0 ξ Figure 4.4: Illustration of G f and G ψ. Intuitively, G ψ plays the role of a window function which cuts out the content of G f around the origin. An important aspect of U[j]f concerns the optimal choice of the wavelet scale j. The factorization of (4.15) indicates that this issue involves a certain trade-off: When j is large (ψj Mor corresponding to low frequencies), G f is almost centered at the point ξ = 0, where G ψ also attains its maximum; however, G ψ is extremely narrow in this case, such that the energy of U[j]f becomes relatively small. For a smaller (possibly negative) scale, G ψ is widely-spread, but G f is then concentrated far away from the origin. Thus, a good choice of j, which maximizes the scattering energy, is somewhere in between these cases, and heavily depends on the peak width parameter β. This is particularly consistent with the fact that the scale j should be adaptively chosen such that the characteristic structure of a spectrum is captured by the wavelet transform. As a useful side-effect, undesired details, such as the baseline (low frequencies) and the associated low-amplitude noise (high frequencies), are filtered out in this way. The second important parameter of SPA-Scattering is the window scale J. The associated low-pass filter of the Morlet wavelet is a Gaussian function (cf. (3.5)) φ(t) Gauss = 1 2πβ 2 0 exp( t 2 /(2β 2 0)), t R, with some fixed parameter β 0 > 0. Thus, computing the scattering coefficients, ( S J [j]f = U[j]f φ Gauss 1 J = U[j]f 2 J β 0 2π exp ( ) ) t 2 2 (2 J β 0, ) 2 actually corresponds to smoothing by a Gaussian kernel with standard deviation of 2 J β 0. Such a simple smoothing step was already part of the original SPA-framework in [21] and will be briefly discussed below in Remark 4.9. We have already pointed out in the course of Theorem 4.7 that the degree of translation invariance is essentially controlled by J: A larger value of J widens the window function φ Gauss and makes the scattering coefficients increasingly robust against deformations, but at the same time, more and more details of the data get lost. The major challenge consists therefore in a trade-off between preserving relevant information and a certain robustness. But again, the right choice of J heavily depends on the quality of the data as well as on the strength of the peak center displacements.

58 52 4 Theory for SPA A very interesting point concerns the question of why we are only using a minor part of the scattering transform S J [P J ]f, namely a single node of the first layer. In fact, there are several reasons for doing so: (1) In practice, the widths of the peaks usually have the same order of magnitude. Hence, the most significant content (energy) of a mass spectrum is probably contained in a single wavelet scale j. (2) Numerical simulations have shown that the scattering energy substantially decreases in the second layer. Thus, one would not benefit from considering longer paths. This observation is also theoretically well-founded by Theorem 4.7(2). (3) The scattering transform is obviously a homogeneous mapping. This particularly implies that the relative peak heights are essentially preserved in the first layer (see also Figure 3.4). But for deeper layers, this is usually not true anymore, which poses the risk of a wrong representation of the protein concentrations. (4) The statements of Theorem 4.7 indicate that the shift invariance is independent of the actual path length. This should be also intuitively clear because the purpose of iterating the wavelet transform is merely to retain lost information. These aspects are clearly motivated by the special structure of mass spectra. However, for more general types of data, it could become necessary to take into account multiple elements of S J [P J ]f, for instance, when there are peaks of different shapes contained in a single data set. Such an extension of SPA would cause several technical difficulties, especially for feature selection. A brief discussion of this challenge can be found in Section 6.1. Finally, it is worth mentioning that there is also a higher-dimensional (and much more technical) version of the scattering transform (see [12, 57] for details). This generalization could become interesting, for example, when we deal with two-dimensional mass spectra, where the time is an additional degree of freedom (see Figure 6.1). Remark 4.9 (Smoothing by a Gaussian Kernel) The step of SPA-Scattering is a real extension of the SPA-approach in [21]. Indeed, the original method just performs a local averaging by a Gaussian window function: ( ) f φ Gauss 1 = f exp( t 2 /(2β 2 2πβ 2 0)), (4.16) 0 where β 0 > 0 has to be adaptively chosen. By our model from Section 4.1, we can assume that f is basically a sum of (almost) Gaussian-shaped atoms. Since the convolution of two Gaussians is again a Gaussian, (4.16) indeed preserves the structure of MS-data. On the other hand, (4.16) operates as a low-pass filter, making the resulting function robust against low-amplitude noise and small peak shifts. These two observations explain why smoothing by Gaussian kernels works empirically well in practice (cf. [21]). Comparing (4.16) with SPA-Scattering, one realizes that the corresponding results look very similar. In fact, the above discussion (especially the argument of (1)) shows that the essential structure of mass spectra can be captured by a single-scale wavelet filtering. The additional scattering propagation therefore affects the data not too much. However, the framework of the scattering transform is designed in a much more generic fashion and

59 4.3 Feature Selection via Robust 1-Bit Compressed Sensing 53 offers a high flexibility. Thus, the step of SPA-Scattering could be easily adapted to more general situations, for instance, when the underlying data contains patterns which correspond to different resolution levels. For the remainder of this chapter, we will assume that the input data set is still generated by the forward model of (FwM) although SPA-Scattering was actually applied to it before. This simplification is justified by the argumentation of Remark 4.9, which showed us that the essential structure of MS-data is not destroyed in this step. The application of SPA-Scattering has however a substantial impact on the underlying probability distributions of the forward model (FwM): Due to the wavelet filtering, the variance of the baseline noise vector N is significantly reduced. And even more crucial in fact, this was the main reason for considering the scattering transform at all this is also true for the peak center variables C 1,..., C M, as we have shown above. In practice, one often observes that the horizontal deviations of the peaks are usually much smaller than their widths, i.e., β m σ(c m ), m [M]. Consequently, we may even assume that the peak shifts are completely eliminated after preprocessing, so that the feature atoms a 1,..., a M R d are actually deterministic vectors. The behavior of the latent factors S 1,..., S M, on the other hand, is only marginally changed by SPA-Scattering because this is a homogeneous operation. 4.3 Feature Selection via Robust 1-Bit Compressed Sensing We now return to the central steps of SPA-Standardize and SPA-Select. In this section, we will analyze the robust 1-bit compressed sensing approach of (Rob 1-Bit), while the last section is devoted to a detailed discussion of (l 1 -SVM) and (LASSO) Why Using (1-Bit) Compressed Sensing? At first, let us recall the general linear decision model (LDM) on which SPA-Select is based: 1 y = sign( x, ω ), y { 1, +1}, x, ω R d. (LDM ) As already stated in Subsection 3.2.4, there are two interpretations of this simple model. 2 From the perspective of machine learning, we would like to learn a vector ω that explains the observed pair (x, y) in terms of (LDM ). Here, the data vector x is actually given by nature and serves as an input quantity for learning algorithms. On the other hand, in (1- bit) compressed sensing, x is rather considered as a measurement vector which is used to acquire a measurement y of ω. In this way, the (input) vector ω is compressed by (1-bit) measurements and the ultimate task is then to recover it from these measurements. 1 To keep the argumentation clear and general, we restate several equations of Chapter 3 here without additional notations (sub- and superscripts, etc.) that were used in the algorithmic part of SPA. 2 For brief introductions to compressed sensing and machine learning, see Appendix A.5 and A.6, respectively.

60 54 4 Theory for SPA The viewpoint of machine learning is of course much more natural for the framework of SPA, where x represents a mass spectrum. In contrast, most theoretical results of compressed sensing rely on the fact that x contains i.i.d. random variables, which have no physical meaning at all. Such an assumption is obviously never satisfied for MS-data, since many entries of x are extremely coherent. In particular, it makes no sense to hope for a unique sparse recovery of ω from (LDM ); and even more crucial, (LDM ) is just a backward model (cf. (BwM) in Subsection 4.1.3), implying that the feature vector ω might not appropriately describe how the datum (x, y) was generated. Therefore, it is sometimes misleading to speak of a recovery in this context. However, there are conspicuous similarities between machine learning and compressed sensing. In fact, the applied algorithms of both fields are resembling (LASSO, basis pursuit, etc.) and their theory relies on the same mathematical fundamentals (concentration of measure, convex geometry, etc.). It is therefore a bit astonishing that both communities have only rarely interacted with each other up to now. Thus, compressed-sensing-based algorithms are not very common for solving classification tasks, although they actually provide promising results. This is also a particular reason why the authors of [21] have put their main focus on the 1-bit compressed sensing approach of (Rob 1-Bit). As a general hope, a further investigation of the relationship between machine learning and compressed sensing could allow for deeper theoretical insights, finally leading to an improvement of the classical (l 1 -based) tools Analysis of (Rob 1-Bit) We are now going to analyze the robust 1-bit compressed sensing approach of SPA- Select in detail. At first, let us restate the optimization problem (Rob 1-Bit): 1 max ω R d k=1 n y k x k, ω s.t. ω 2 1 and ω 1 s, (Rob 1-Bit ) where x 1,..., x n R d are input vectors (mass spectra), y 1,..., y n { 1, +1} binary observations (health status), and s > 0 the sparsity parameter. Since the objective functional of (Rob 1-Bit ) is linear and its constraints are convex, we indeed have a convex program, which always attains a (possibly non-unique) maximum. 2 Moreover, the linearity allows for a further simplification: Putting x := n k=1 y kx k R d, it turns out that (Rob 1-Bit ) is equivalent to max ω R d x, ω s.t. ω and ω 1 s. (4.17) Note that a square was added to the l 2 -constraint here, in order to make the following computations a bit more convenient. This observation is remarkable because it implies 1 For the sake of clarity, we again simplify the notation that was used in Subsection An introduction to convex optimization is given in Appendix A.4.

61 4.3 Feature Selection via Robust 1-Bit Compressed Sensing 55 Figure 4.5: An illustration of the geometry of K d,s in two dimensions (d = 2). Intuitively, an intersection of sb d 1 with the euclidean unit ball leads to a smoothing of the vertices of the l 1 -diamond. that the solution of (Rob 1-Bit ) solely depends on the covariance vector x. 1 dealing with centered MS-data, we particularly obtain When x = n y k x k = k=1 k G + x k k G x k }{{} = k G + x k = 2 k G + x k, which is proportional to the average spectrum of each of the two groups. Thus, solving (Rob 1-Bit ) does not take into account any specific information that is manifested within the groups. In fact, we will see in Proposition 4.10 that (Rob 1-Bit ) is nothing else but a (rescaled) soft-thresholding of x. Before, let us consider the sparsity-promoting constraint of (Rob 1-Bit ), which reads ω K d,s := B2 d sb1 d in set notation; see Figure 4.5 for a visualization. To understand how K d,s is related to the classical notion of sparsity, we consider the set and observe that S d,s := {z R d z 2 1, z 0 s} conv(s d,s ) K d,s 2 conv(s d,s ). For a proof of these inclusions, see [64, Lemma 3.1]. Thus, K d,s is essentially the convex relaxation of all s-sparse vectors lying in the unit ball. This yields a robust and real-valued version of sparsity, which is referred to the notion of effective sparse vectors in [64]. The following proposition reveals the explicit structure of the solutions of (Rob 1-Bit ) and gives a sufficient condition for uniqueness: 1 When the samples (y 1,..., y n) and (x 1,..., x n) are (empirically) centered, their (component-wise) covariance is indeed given by 1 n n k=1 y kx k, which is equal to n x.

62 56 4 Theory for SPA Proposition 4.10 Let x := n k=1 y kx k and assume that there exists an l [d] such that x l > x l for l l (i.e., the maximal entry of x is unique). Then, the maximizer ω # R d of (Rob 1-Bit ), with input {(x k, y k )} k [d], is unique and can be expressed by a rescaled soft-thresholding of x. More precisely, there exist a scaling factor λ 1 > 0 and a threshold λ 2 0, both depending on s and x, such that x l λ 2 ω # λ 1, x l λ 2, x l = l +λ 2 λ 1, x l λ 2, l [d]. (4.18) 0, otherwise, Proof. As already stated above, we may solve (4.17) instead of (Rob 1-Bit ). Thus, let ω # be any maximizer of (4.17). Since there obviously exists a vector ω R d with ω 2 2 < 1 and ω 1 < s, Slater s condition of Theorem A.17 is satisfied and strong duality holds for our problem. Hence, by Proposition A.18, there exist ν 1, ν 2 0 such that ω # is also a minimizer of 1 L(ω) := L(ω, ν 1, ν 2 ) := x, ω + ν 1 ω ν 2 ω 1 d = ( x l ω l + ν 1 ωl 2 + ν 2 ω l ), ω = (ω 1,..., ω d ) R d. }{{} l=1 =:L l (ω l ) It is obviously sufficient to minimize each summand L l (ω l ) separately. So, let us consider some fixed index l [d] now. Since ω 2 l and ω l do not depend on the sign of ω l, a necessary condition for minimizing ω l L l (ω l ) is that sign( x l ) = sign(ω # l ). We shall to distinguish two cases for the value of ν 1 : ν 1 = 0: Let us first assume that x l 0. In this case, ω # l 0 minimizes ω l L l (ω l ) = ν 2 ω l ω l x l. If ν 2 > x l, we have L l (ω l ) = ν 2 ω l ω l x l ω l (ν 2 x l ) 0, }{{} >0 i.e., ω # l = 0 is the unique minimizer. On the other hand, if ν 2 < x l, we obtain L l (ω l ) = ν 2 ω l ω l x l = ω l (ν 2 sign(ω l ) x l ), which becomes arbitrarily small as ω l. So we end up with a contradiction here. Since this argument immediately applies to the case of x l 0, we conclude that ν 1 = 0 can only hold if x ν 2. Therefore, it remains to consider the case of x l = ν 2. Here, the minimizer of L l is not unique anymore, but by the assumption of the proposition, we may assume that x l is the 1 Note that we can easily transform (4.17) into a minimization problem by adding a minus sign to the scalar product.

63 4.3 Feature Selection via Robust 1-Bit Compressed Sensing 57 unique maximum of x. Then, x l < ν 2 for all l l, which implies that ω # l = 0. Hence, ω # has only one single non-zero entry, and recalling (4.17), it is easy to see that this entry must be unique. In particular, ω # can be written as a soft-thresholding of x with appropriate choices of λ 1 and λ 2. ν 1 > 0: Let us again assume that x l 0. At first, we observe that, for ω l < 0, we have L l (ω l ) = x l ω }{{} l + ν 1 ωl 2 + ν 2 ω l > 0 = L l (0). }{{}}{{} 0 >0 0 Hence, the minimum of L l must be attained for some ω l 0. Since L l is smooth on R \ {0}, we obtain for ω l > 0 that L l (ω l) = x l + 2ν 1 ω l + ν 2! = 0 ω l = x l ν 2 2ν 1. Obviously, this statement makes only sense when x l > ν 2. In this case, we have ( ) L l (ω l ) = L xl ν 2 l 2ν 1 = ( x l ν 2 ) 2 4ν 1 < 0 = L(0). The minimizer ω # is again unique because L l is strictly convex. Thus, we obtain ω # l = x l ν 2 2ν 1 for x l > ν 2, and if x l ν 2, we simply have ω # l = 0. But this can be clearly identified with a rescaled soft-thresholding, where λ 1 = 2ν 1 and λ 2 = ν 2. The argumentation for x l 0 works exactly the same and finally leads to the claim of (4.18). Remark 4.11 Note that the mild assumption of a unique maximal entry of x in Proposition 4.10 was sufficient to guarantee a unique solution of (Rob 1-Bit ). However, when this is omitted, one can still show that there exists a maximizer ω # which can be written as a soft-thresholding of x. The parameters λ 1 and λ 2 both depend on x and s, but since the equivalence of Proposition 4.10 is very implicit, it seems to be difficult to derive an explicit formula for them. However, we shall not worry about this missing relationship because our major interest concerns qualitative results on feature selection anyway. Moreover, the numerical experiments of Chapter 5 give a strong evidence that the threshold λ 2 decays monotonically and continuously as s is increased. Proposition 4.10 shows us that SPA-Select via (Rob 1-Bit) is a relatively simple approach. In fact, it can be assigned to the class of mass-univariate methods, meaning that each feature of the data set is analyzed and evaluated separately. This is contrary to multivariate methods, such as (l 1 -SVM) and (LASSO), which build a classifier by combining information of different feature variables. Therefore, it is not very surprising that the application domain of our 1-bit compressed sensing approach is quite limited. But on the other hand, its simplicity brings the advantage that the selection process is much easier to understand, which especially allows for a rigorous analysis.

64 58 4 Theory for SPA Statistical Analysis of the Proteomics Model We are now going to investigate to the actual feature selection step of SPA. More precisely, we would like to analyze how SPA-Standardize and SPA-Select via (Rob 1-Bit) perform for our proteomics forward model from Section 4.1. In this subsection, it is therefore assumed that the input spectra x 1,..., x n R d are drawn from the probability distribution of X in (FwM). Furthermore, by the last paragraph of Subsection 4.2.2, we may suppose that SPA-Scattering was already applied to x 1,..., x n, and in particular, that the feature atoms a 1,..., a M are deterministic with C 1 c 1,..., C M c M. 1 At first, we shall study the impact of SPA-Standardize. Recalling the individual lines of Algorithm 3.4, it turns out that the actual computations are just empirical estimates of statistical quantities (expectation, correlation, etc.; see Definition A.8). Hence, there also exists a continuous analog of SPA-Standardize which can be applied to the random vector X = (X 1,..., X d ). Since all operations are performed component-wise, we fix some l [d] now and consider X l = M S m a m,l + N l. m=1 The (continuous version of) Lines 1 and 2 of SPA-Standardize basically perform a centering of X l, that is, E(X l ) = m E(S m )a m,l + E(N l ) = }{{} m =0 X l := X l E(X l ) = m E(S m )a m,l, (S m E(S m )) }{{} =: Sm a m,l + N l = S m a m,l + N l. Note that centering Y is not necessary, due to Y Rad. Since N l is independent of Y and S 1,..., S M, the variance and correlation of the Lines 3 and 4 are given by ( ) σx 2 l := V(X l ) = V( X 2 l ) = V S m a m,l + V(N l ) = σa }{{} l + σn 2 l, m }{{} =:σ 2 =:σa 2 N l l ρ Xl := ρ(y, X l ) = E(Y X l ) V(Y ) V(Xl ) m = m a m,l E(Y σ S m ) = Xl }{{} m (4.5) = σ Sm ρ Sm a m,l σ Sm σ 2 Al + σ2 Nl ρ Sm. (4.19) The expression for σx 2 l cannot be simplified further at the moment because we did not assume that S 1,..., S M are independent. The final interpolation of Line 5 can be adopted 1 The output of SPA-Scattering is actually d -dimensional in Chapter 3. But in order to avoid an overload of notation, we will simply use the letter d for the dimension of the ambient space here.

65 4.3 Feature Selection via Robust 1-Bit Compressed Sensing 59 literally: X std l,α := α l λ X l + (1 α l ) = ( α l λ + 1 α l σa 2 l + σn 2 l X l σ Xl ) ( ) S m a m,l + N l, where α l = α cexp(1 ρ X ) l with α [0, 1], c exp 0, and λ > 0. Similar to the empirical algorithm of SPA-Standardize, one can regard these computations as a weighted interpolation between X l and its standardization X l σ Xl. Next, we shall adapt SPA-Select via (Rob 1-Bit) to the continuous proteomics model as well: The previous subsection showed us that (Rob 1-Bit) actually performs a softthresholding of x := n k=1 y kx std k. Up to a constant factor of 1 n, the vector x is equal to the empirical covariance between (x std 1,..., xstd n ) and (y 1,..., y n ). The continuous analog of (Rob 1-Bit) is therefore obviously a soft-thresholding of the covariance vector Cov(Y, Xα std ) = E(Y Xα std ). This is again a component-wise operation, and for a fixed l [d], we can expand the covariance in terms of the model parameters: ( ) ( ) Cov(Y, Xl,α std ) = E(Y Xstd l,α ) = α l λ + 1 α l a m,l E(Y S m ) = ( α l λ + 1 α l σa 2 +σ 2 l N l ) m σ 2 A l +σ 2 N l m ( ) a m,l σ Sm ρ Sm. (4.20) m This expression allows for a much better understanding of the feature selection process via (Rob 1-Bit). At first, let us assume that α = 1. Then, (4.20) can be simplified to ( ) Cov(Y, Xl,1 std ) = λ a m,l σ Sm ρ Sm. (4.21) m Thus, the order in which the features are selected essentially depends on the values of a m,l σ Sm ρ Sm, m [M]. In particular, we can make the argumentation of Subsection precise now: If the m-th feature atom is large around the index l, meaning that l is close to the peak center c m, and has a noisy amplitude (σ Sm large), then the product a m,l σ Sm ρ Sm will be also large, even when ρ Sm is relatively small. Such a situation is quite undesirable because our goal is rather to select features in terms of their disease correlation coefficients ρ S1,..., ρ SM. In contrast, if α = 0, (Rob 1-Bit) degenerates to a pure correlation analysis of X and Y ; in fact, we obtain Cov(Y, X std l,0 ) = m a m,l σ Sm σ 2 Al + σ2 Nl ρ Sm (4.19) = ρ Xl. (4.22) At this point, the discrepancy between forward and backward models, which we have

66 60 4 Theory for SPA discussed in Subsection 4.1.3, comes into play: Indeed, (4.21) and (4.22) indicate that the output of (Rob 1-Bit) is in general not sufficient to recover the desired correlation coefficients ρ S1,..., ρ SM. This problem is actually due to the fact that 1-bit compressed sensing relies on the simplified backward model of (BwM), which does not incorporate any prior knowledge about the latent factors or feature atoms. As a consequence, to be still able to extract relevant information from solutions of (Rob 1-Bit), we need to impose further assumptions on the proteomics model (FwM). A careful study of real-world data sets (see Figure 2.2 for example) shows that the pairwise distances between the peak centers c 1,..., c M are relatively large compared to the corresponding peak width parameters β 1,..., β M, i.e., c m c m max{β m, β m } for all m, m [M] with m m. Since Gaussians possess a rapid spatial decay, we may therefore assume that each individual peak is well isolated from the remaining ones. This particularly implies that, for a fixed l [d], there exists at most one significant feature atom a m. Thus, let us assume that l is close to the center of the m -th feature atom. Then, we have a m,l a m,l for m m and the above computations can be (approximately) simplified: ( σa 2 l = V Cov(Y, X std l,0 ) = m m ) S m a m,l V ( Sm a m,l) = a 2 m,l σ2 S, m a m,l σ Sm σ 2 Al + σ2 Nl ρ Sm a m,l σ Sm a 2 m,l σ2 Sm + σ2 Nl ρ Sm, Cov(Y, X std l,1 ) = λ ( m a m,l σ Sm ρ Sm ) λ a m,l σ Sm ρ Sm. Supposed that σ 2 S m σ 2 N l, the signal-to-noise ratio 1 σ 2 A l σ 2 N l a2 m,l σ2 S m σ 2 N l becomes large, so that we finally obtain the approximation Cov(Y, X std l,0 ) ρ S m for α = 0. In this case, (Rob 1-Bit) has indeed a very desirable behavior, which allows us to interpret the covariance vector directly. Finally, we shall consider the case where the index l [d] corresponds to a feature variable which is sufficiently far away from any of the peak centers. The corresponding values a 1,l,..., a M,l are then almost negligible compared to the amplitude N l of the baseline noise, i.e., the signal-to-noise ratio σ 2 A l /σ 2 N l becomes rather small. Therefore, we can 1 In general, the signal-to-noise ratio is defined as the ratio between the power of the signal (here, peaks) and the power of the noise (here, baseline noise). In probability theory, the term power is referred to the variance of random variables. Thus, the ratio σa 2 l /σn 2 l precisely corresponds to this definition.

67 4.3 Feature Selection via Robust 1-Bit Compressed Sensing 61 estimate the covariance of (4.20): 1 Cov(Y, X std l,α ) = α l λ + a m,l σ Sm ρ Sm 1. }{{} m }{{} 1 1 α l σa 2 +σ 2 l N l 1 α l σ Nl This particularly implies that the l-th feature will not be selected by a soft-thresholding, provided the sparsity parameter s > 0 has been appropriately chosen (see next subsection). Stated in different words, we can conclude that SPA-Select via (Rob 1-Bit) does not tend to overrate low-amplitude features, which especially avoids false-positive selections. 2 The findings of the previous paragraphs can be roughly summarized as follows: The covariance Cov(Y, Xl,α std ) is (almost) proportional to the disease correlation coefficient ρ Sm if the index l [d] is close to the peak center of the m-th feature atom. On the other hand, if l is far away from any of the peak centers, Cov(Y, Xl,α std ) is negligibly small. This statement gives a strong theoretical evidence that the approach of (Rob 1-Bit), although very simple, is indeed able to provide promising disease fingerprints. The above argumentation might suggest that a small value of α should be usually preferred because we have an explicit interpretation for Cov(Y, Xl,α std ) in this case. However, choosing α close to 0 is not always the best option when dealing with (real-world) MS-data. This is primarily due to the fact that SPA-Standardize and SPA-Select are actually only empirical estimates of the above statistical quantities: The number of given samples n is usually much smaller than the data dimension d, implying that the actual behavior of several feature variables could significantly differ from their theoretical prediction. This particularly concerns the baseline noise vector N whose empirical correlation with Y is almost never exactly equal to 0. A pure soft-thresholding of the empirical correlation between (x 1,..., x n ) and (y 1,..., y n ) might therefore cause certain instabilities, especially when the sample set is not large enough. Alternatively, we could exploit the specific (peak-)structure of MS-data, by choosing α closer to 1. This would make the selection process more robust against noise and a geometric separation of the sample groups might become easier. But on the other hand, we have figured out (cf. (4.21)) that α 1 could have a negative influence on the feature ranking, posing the risk that some relevant features are missed. This observation is also consistent with the heuristic argumentation of Subsection 3.2.3, where we stated that the interpolation parameter α controls the trade-off between separability and correlation (with the disease label). 1 This estimation is of course not very rigorous. Actually, it shall just indicate that the magnitude of the covariance is significantly smaller than in the case where l is close to a peak maximum. 2 In our context, a feature is called false-positive if it is biologically irrelevant but has been however selected by the applied algorithm.

68 62 4 Theory for SPA Remark 4.12 (The Role of the Baseline Noise) The computation of (4.21) shows that the baseline noise N = (N 1,..., N d ) plays almost no role in the case of α = 1. The situation for α 0 is very different because the variance of N l appears explicitly in (4.22). At a first sight, this circumstance seems to be somewhat undesirable, but a complete absence of (low-amplitude) noise could surprisingly lead to serious problems as well. Indeed, for a fixed l [d], let us assume that N l 0 and that there is an m [M] such that a m,l a m,l for m m. This merely implies that l is much closer to c m than to the other peak centers; but in contrast to the above setting, the actual value of a m,l does not necessarily has to be close to the maximum of a m. Since σa 2 l σn 2 l = 0, we can approximate (4.22) again by Cov(Y, Xl,0 std) ρ S m. Thus, the relevance of the l-th feature variable becomes basically independent of its spatial location. In other words, (Rob 1-Bit) has lost the nice property of implicit peak detection. This drawback is not just a theoretical phenomenon but can be also observed for the simulated data in Section 5.1 (see Remark 5.1(2)). Since (x 1, y 1 ),..., (x n, y n ) are drawn i.i.d. from (X, Y ), the strong law of large numbers (Theorem A.13) states that the empirical covariance vector 1 n n k=1 y kx std k (almost surely) converges to Cov(Y, Xα std ) as n. But unfortunately, this does not say anything about the actual asymptotic behavior (in terms of n). In general, the speed of convergence highly depends on the distributions of the latent variables and the noise, which we did not have specified at all. For the realization of Example 4.2, one could use concentration inequalities to prove an asymptotic statement on the number of required samples; but this would probably bring no deeper insights because the underlying distributions are too specific for an approximation of real-world data. However, the number of required observations is a central issue in statistical learning theory and could be an interesting direction of future research. In this work, we are actually content with the fact that SPA-Select via (Rob 1-Bit) provides promising and stable results, supposed that the number of samples is sufficiently large Solution Paths and Sparsity In this final part, we return to the original challenge of extracting sparse disease fingerprints. We have already figured out in Subsection that the (effective) sparsity of the maximizer of (Rob 1-Bit) can be controlled by the parameter s > 0. In such a situation, it is very common (especially in machine learning) to consider the solution path of an algorithm, i.e., a mapping of the (regularization) parameters to the corresponding output vector, where the input data set stays fixed. For the particular case of (Rob 1-Bit), we have R >0 R d, s ω 1-Bit (s) := argmax x, ω, (4.23) ω K d,s where K d,s = B2 d sb1 d and x = n k=1 y kx std k. Assuming that x has a unique maximal entry, Proposition 4.10 shows that this map is indeed well-defined and that ω 1-Bit (s) can

69 0 ω l λ Feature Selection via Robust 1-Bit Compressed Sensing 63 ω l λ 2 Figure 4.6: Visualization of a solution path λ 2 ω Soft (λ 1, λ 2 ) with λ 1 = 1. Every colored line corresponds to another feature variable and the vertical dashed lines indicate the events where we have x l = λ 2 for some l [d]. be always written as a rescaled soft-thresholding: ωl 1-Bit (s) := ωl Soft (λ 1, λ 2 ) := x l λ 2 λ 1, x l λ 2, x l +λ 2 λ 1, x l λ 2, 0, otherwise, l [d], where λ 1 = λ 1 (s) > 0 is the scaling factor and λ 2 = λ 2 (s) 0 the threshold. The relationship s λ 2 (s) is unfortunately not explicitly known (cf. Remark 4.11). But since we are rather interested in the qualitative behavior of the feature selection, it is more convenient to switch to the perspective of soft-thresholding and to consider the mapping R >0 R 0 R d, (λ 1, λ 2 ) ω Soft (λ 1, λ 2 ) := (ω Soft 1 (λ 1, λ 2 ),..., ω Soft d (λ 1, λ 2 )), (4.24) instead of (4.23). Thus, the following arguments will be stated in terms of the parameters λ 1 and λ 2. Moreover, we may assume w.l.o.g. that λ 1 = 1 because this is just a scalar factor which does not affect the support of ω Soft. To get a better intuition for (4.24), let us first suppose that λ 2 > x. Then, the solution is obviously trivial, i.e., ω Soft (λ 1, λ 2 ) = 0. When λ 2 is continuously decreased, we eventually achieve x l = λ 2 for some l [d]. This is exactly the point where the l-th feature is selected ; the corresponding entry of ω Soft (λ 1, λ 2 ) now linearly grows with λ 2. Decreasing the threshold λ 2 even further, more and more features get selected in the same way and the support of ω Soft (λ 1, λ 2 ) enlarges. If λ 2 = 0, we finally end up with ω Soft (λ 1, λ 2 ) = x. The solution path λ 2 ω Soft (λ 1, λ 2 ) is therefore a piece-wise linear function, as illustrated in Figure 4.6. In particular, we observe that the sparsity of ω Soft, i.e., ω Soft (λ 1, λ 2 ) 0 = # supp(ω Soft (λ 1, λ 2 )) = #{l [d] x l > λ 2 } is a monotonously decreasing mapping of the threshold parameter λ 2.

70 64 4 Theory for SPA For this reason, soft-thresholding is indeed able to provide very sparse feature vectors when λ 2 is appropriately chosen. However, this property does not necessarily imply that these (few) selected features are also biologically relevant. We have already pointed out in Subsection that the peak structure of MS-data leads to an undesirable behavior of (Rob 1-Bit). This is due to the fact that the average spectrum x = n k=1 y kx std k still consists of (approximate) Gaussians, so that a thresholding usually selects multiple features of a single peak; see Figure 3.6 for an illustration of this phenomenon. The solution path ω Soft will therefore always contain redundant features, which shall not be part of a minimal disease fingerprint. In order to tackle this drawback, we have incorporated the additional postprocessing step of SPA-Sparsify, which reduces consecutive sequences of features to their most significant entries. This type of redundancy is directly related to coherence of Type II, which we have introduced in Subsection 4.1.2: Indeed, let us suppose that l, l [d] are both closely located around the peak center of the m-th feature atom. The corresponding feature variables and Xstd l,α are then both dominated by the amplitude of the latent factor S m, which produces a strong cross-correlation, i.e., ρ(xl,α std, Xstd l,α) 1. From this perspective, we can consider SPA-Sparsify as an extension of (Rob 1-Bit) that eliminates redundancies of Type II. However, due to its spatial localization, this approach is not able to detect the global coherence between the feature atoms, that is, coherence of Type I. X std l,α Finally, let us summarize the key results of this section. In Subsection 4.3.2, we have proven that SPA-Select via (Rob 1-Bit) is actually equivalent to a (rescaled) softthresholding of the empirical covariance vector 1 n n k=1 y kx std k. But the careful statistical analysis of (FwM) in Subsection has shown that a combination of (Rob 1-Bit) and SPA-Standardize still allows for a robust detection of biomarkers from MS-data (provided sufficiently many samples are available). The current subsection has finally introduced the important concept of solution paths, which is extremely useful for analyzing (sparse) feature selection algorithms. In this context, it has particularly turned out that redundant selections (of Type II) are inevitable when working with (Rob 1-Bit). In the following section, we will see that the approaches of (l 1 -SVM) and (LASSO) do not suffer from this drawback, which makes them much more flexible and applicable to numerous types of data. 4.4 Feature Selection via l 1 -SVM and the LASSO This section deals with the two alternative algorithms for SPA-Select, namely (l 1 -SVM) and (LASSO). Compared to the univariate approach of (Rob 1-Bit), these are multivariate methods, which are particularly able to overcome the problem of redundant feature selection. The non-linearity of the optimized functionals, however, makes a theoretical analysis much more challenging. This part (especially Subsection 4.4.3) will be therefore less rigorous than the previous one and is rather based on heuristic arguments.

71 4.4 Feature Selection via l 1 -SVM and the LASSO Geometric Intuition Behind the Squared and Hinge Loss Similar to 1-bit compressed sensing, our major goal is still to learn an appropriate classifier ω R d for the linear decision model (LDM ), that is, y = sign( x, ω ), y { 1, +1}, x R d, where we are given a finite set of sample pairs {(x k, y k )} k [n] R d { 1, +1}. A large class of algorithms for such a classification task, including (l 1 -SVM) and (LASSO), tries to find ω by solving a constrained optimization problem of the form 1 min ω R d k=1 n L(y k, x k, ω) subject to J(ω) s, (4.25) with some regularization parameter s 0. Here, L: R R d R d R 0 is the loss functional, which serves as an accuracy measure for the classification of the training set, and the constraint functional J : R d R 0 plays the role of a penalty, enforcing certain properties of a minimizer of (4.25), e.g., sparsity. Let us now consider (l 1 -SVM) and (LASSO) in their general form (without the additional notation from Subsection 3.2.4): min ω R d k=1 min ω R d k=1 Thus, (l 1 -SVM ) is based on the hinge loss n [1 y k x k, ω ] + subject to ω 1 s, (l 1 -SVM ) n (y k x k, ω ) 2 subject to ω 1 s. (LASSO ) L Hng (y, x, ω) := [1 y x, ω ] + = max{0, 1 y x, ω }, whereas (LASSO ) uses the classical squared loss L Sq (y, x, ω) := (y x, ω ) 2. In the following, we will see the geometric intuition behind these two functionals, and in particular, the way in which they penalize outliers. A detailed discussion of the impact of the sparsity-promoting constraint J(ω) = ω 1, which is the same for both approaches, is then postponed to the next subsections. For the remainder of this subsection, let us assume that the input data set {(x k, y k )} k [n] R d { 1, +1} has already passed through the steps of SPA-Scattering and SPA- Standardize. In order to understand the above loss functionals better, it is a good idea to recall the geometric perspective of Section 3.1, especially of Figure 3.1. More precisely, 1 Note that in many cases, for instance, when L and J are convex, this can be equivalently formulated as a regularized optimization problem (see also Proposition A.19 in Appendix A.4).

72 66 4 Theory for SPA Figure 4.7: Plots of the L Hng and L Sq in terms of y x, ω. Note that we actually have L Sq (y, x, ω) = (y x, ω ) 2 = (1 y x, ω ) 2. we shall view x 1,..., x n as points in R d which are labeled by y 1,..., y n, respectively, and ω is the normal vector of the separating hyperplane H(ω, 0). 1 A first important observation is that both L Hng (y, x, ω) and L Sq (y, x, ω) depend only on the scalar product of x and ω; Figure 4.7 shows the graphs of the hinge and squared loss in terms of y x, ω. Geometrically seen, x, ω is proportional to the signed distance of x to H(ω, 0): In fact, ω sign( x, ω ) specifies on which side of the hyperplane plane the point x lies, and x, ω 2 measures the (shortest) euclidean distance between x and H(ω, 0). Now, we fix the normal vector ω R d \ {0} as well as some (unknown) sample pair (x, y) R d { 1, +1}. If (x, y) is correctly predicted by (LDM ), then x lies on the correct side of H(ω, 0), i.e., y x, ω > 0 (see also Figure 3.1). Such a situation is obviously desirable when minimizing the hinge loss. Indeed, the further x is away from the decision hyperplane the smaller becomes L Hng (y, x, ω) = [1 y x, ω ] +. But at some point, we eventually have L Hng (y, x, ω) = 0, and the distance of x to H(ω, 0) does not bring any additional contribution to the minimization. In this case, we usually say that x lies behind the margin that was generated by ω; see Figure 4.8 for an illustration. On the other hand, if (x, y) is misclassified by (LDM ), the point x lies on the wrong side of the hyperplane and the hinge loss imposes a linear penalty when the distance x, ω grows. Intuitively spoken, minimizing the hinge loss functional in (l 1 -SVM ) can be identified with the exercise of finding a hyperplane H(ω, 0) whose margin optimally separates the training set. To make precise what is meant by an optimal margin, it is useful to consider the classical notion of a support vector machine before: Remark 4.13 (Classical l 2 -SVMs) Let us assume that the training points are perfectly separable, i.e., there exists at least one ω R d such that y k = sign( x k, ω ) for all k [n]. The main goal is now to find the biggest possible margin between the two groups. This challenge can be met by solving (see also Figure 4.8) max M>0, ω R d ω 2 1 M subject to y k x k, ω M for all k [n], (4.26) 1 Note that we may assume that the intercept ω 0 is equal to 0 here, since the data set has been already centered.

73 4.4 Feature Selection via l 1 -SVM and the LASSO 67 Figure 4.8: The red region illustrates the margin (of width M = 1/ ω 2 ) that is generated by ω. Formally, it is given by the set {x R d x, ω 1}. A vector x R d lies behind this margin if y x, ω 1, meaning that the hinge loss vanishes. or equivalently, min ω R d ω 2 subject to y k x k, ω 1 for all k [n]. A (possibly non-unique) solution ω # of these optimization problems precisely corresponds to a hyperplane which maximizes the euclidean distance to the (labeled) sample set (cf. [40]). Moreover, it turns out that ω # can be solely written as a linear combination of those data points that lie on the boundary of its margin, i.e., x k, ω # = 1. These points are therefore also called support vectors of ω #, which particularly explains the name support vector machine. However, (4.26) becomes infeasible if the data is not perfectly separable, such as in Figure 3.1. This drawback has led to a constrained version of the l 2 -SVM, min ω R d k=1 n [1 y k x k, ω ] + subject to ω 2 s, (l 2 -SVM) which can also deal with outliers, lying within or on the wrong side of the margin. The optimization problem of (l 2 -SVM) is quite similar to (l 1 -SVM ), only differing in the penalty constraint. Support vector machines and their extensions form a keystone of machine learning and are studied in almost every textbook on this field. For a good introduction to SVMs, see [40], and for a more comprehensive work, the interested reader may consider [74]. The notion of optimal margins from the previous remark can be also generalized to the concept of l 1 -SVMs by changing the distance measure between a point and a hyperplane:

74 68 4 Theory for SPA Replacing the l 2 -norm in (4.26) with the l 1 -norm, we obtain max M>0, ω R d ω 1 1 M subject to y k x k, ω M for all k [n], (4.27) and it can be shown (cf. [49, 58]) that the resulting hyperplane maximizes the margin with respect to the dual norm of l 1, which is the l -norm. The constrained version of (4.27) then precisely corresponds to (l 1 -SVM ). Note that this observation can be even generalized to an arbitrary norm (and its dual norm ); see [49, 58] for details. While the concept of margin maximization is quite natural for the task of separating (labeled) data points, the LASSO was originally designed for regression problems. We have already pointed out subsequently to Algorithm 3.7 that minimizing the squared loss (ordinary least squares) is a classical approach to fit a linear regressions model y = x, ω, y R, x, ω R d. Compared to (LDM ), the output variable y is usually assumed to be real-valued in this setting. Furthermore, in contrast the hinge loss L Hng, the squared loss L Sq penalizes also those points which lie on the correct side of the margin: Indeed, we have L Sq (y, x, ω) = 0 y = x, ω, meaning that x has to lie exactly on the boundary of the margin (cf. Figure 4.8). The LASSO therefore seems to be a bit inappropriate for separating tasks at a first sight. However, it was recently shown that the LASSO works also surprisingly well for nonlinear models, such as (LDM ); see [66] for example. Hence, it is not very astonishing that the experiments for SPA-Select via (l 1 -SVM) and (LASSO) in Chapter 5 deliver comparable results. Another interesting observation can be made by a reformulation of (LASSO ): min ω R d k=1 min ω R d k=1 min ω R d max ω R d k=1 n (y k x k, ω ) 2 s.t. ω 1 s n yk 2 2y k x k, ω + x k, ω 2 s.t. ω 1 s n y k x k, ω + 1 n x k, ω 2 2 s.t. ω 1 s k=1 k=1 n y k x k, ω n s.t. x k, ω 2 ν and ω 1 s, k=1 where means that the sets of optimal solutions are equal, and for the last step, one may argue as in the proof of Proposition A.19. The last line looks remarkably similar to (Rob 1-Bit ). The actual difference is that the l 2 -constraint ω 2 1 is replaced by a data-weighted l 2 -penalty here. This shows that the LASSO although it is a real multi-

75 4.4 Feature Selection via l 1 -SVM and the LASSO 69 variate method is not so different from the robust 1-bit compressed sensing approach as one might have expected at the beginning Piecewise-Linear Solution Paths So far, we have analyzed the loss functionals of (l 1 -SVM ) and (LASSO ), independently of the additional l 1 -constraint. In particular, the feature vector ω R d has been fixed in order to understand the general geometry of the linear decision model (LDM ). Now, we return to the actual minimization problems and would like to investigate the qualitative structure of their solutions. Similar to our analysis of (Rob 1-Bit ) in Subsection 4.3.4, it is very useful to consider the solution paths of the l 1 -SVM and the LASSO: s ω SVM (s) := argmin ω R d s ω LASSO (s) := argmin ω R d n [1 y k x k, ω ] + s.t. ω 1 s, k=1 n (y k x k, ω ) 2 s.t. ω 1 s. k=1 Remark 4.14 (Uniqueness of the Solution Paths) At the moment, it is not even clear whether these path-mappings are really well-defined. Indeed, in the setting of n < d, this is not guaranteed in general. The question of unique solutions for both (l 1 -SVM ) and (LASSO ) is unfortunately highly non-trivial and would exceed the scope of this thesis. However, there exists already some recent work on this problem, for instance, in [79], where sufficient and necessary conditions for the uniqueness of the LASSO are presented. For our model of MS-data, which is based on random distributions, one might even show that the results of [79] apply almost surely. Therefore, it is quite reasonable to assume for the following argumentation that ω SVM and ω LASSO are both well-defined for all s 0. In practice, on the other hand, the issue of uniqueness might become much more challenging, since the actual solution (path) does not only depend on the analytic optimization problem but also on its implementation. Although the non-linearity of the hinge and squared loss makes an analysis more complicated, the solutions paths of (l 1 -SVM ) and (LASSO ) are actually well-understood. The following result reveals the qualitative behavior of ω SVM and ω LASSO : Theorem 4.15 (Piecewise-Linear Paths; [26, 89]) Let us assume that the solution of (l 1 -SVM ) is unique for input data (x 1, y 1 ),..., (x n, y n ) R d { 1, +1} and every s 0. Then s ω SVM (s) is a piecewise-linear and continuous map in each component. In this case, we also speak of a piecewise-linear path. The same (literally) holds for the solution path of (LASSO ). Proof. For a full proof of the l 1 -SVM-case see [89, Appendix] and for the LASSO, see [69, Theorem 2].

76 70 4 Theory for SPA ω l 0 s Figure 4.9: Visualization of a piecewise-linear solution path s ω SVM (s). Every colored line corresponds to an activated feature variable ω l. The vertical dashed lines indicate the events (joints) when a residual [1 y k x k, ω ] + vanishes, or when a non-zero variable ω l becomes zero. Although the proof of Theorem 4.15 is omitted here, it is insightful to see at least some intuitive verification of the statement. We start with the case of the l 1 -SVM: At first, let us assume that we are at the beginning of the path, i.e., s = 0. Then, the solution is obviously trivial, ω SVM (0) = 0. Recalling (l 1 -SVM ), we observe that both the hinge loss and the l 1 -penalty are continuous, piecewise-linear functions in each component. When s is now slightly increased, there is always (exactly) one variable ω l that contributes the most to minimizing the hinge loss [ ] L Hng (ω) := L n n d Hng (ω 1,..., ω d ) := [1 y k x k, ω ] + = 1 y k x k,l ω l. k=1 Thus, the optimal solution path is obtained when ω l changes linearly as s gets larger. But at some point, we arrive at a kink of L Hng, i.e., one of the hinge loss residuals [1 y k x k, ω ] + becomes zero. This particularly implies that the corresponding summand of L Hng vanishes and the contribution of ω l is smaller. Hence, a second feature variable ω l may get activated (selected), so that the optimal path ω SVM now evolutes linearly in ω l and ω l. As s grows even further, we eventually arrive at another kink of L Hng and a third feature is activated. Continuing this procedure, more and more feature variables become non-zero, and we finally end up with the full piecewise-linear solution path ω SVM ; see Figure 4.9 for an illustration. This intuitive argument is made precise in [89, 90], where the authors provide a path-following algorithm that computes ω SVM explicitly. In particular, they determine the slopes of ω SVM as well as the joints where the residual terms [1 y k x k, ω ] + turn from non-zero to zero (and vice versa). The original idea of path-following algorithms goes back to [26], where the concept of least angle regression (LARS) was introduced. The LARS algorithm computes a piecewiselinear path s ω LARS (s) in a very similar way as described above. Here, the active k=1 l=1 +

77 4.4 Feature Selection via l 1 -SVM and the LASSO 71 feature variables again evolve linearly (in s), and a new feature joins the active set when it has the strongest correlation with the residual vector y X ω LARS (s). 1 In fact, it can be shown that a slightly modified version of LARS is equivalent to the LASSO ([26, Theorem 1]), which particularly allows us to compute the LASSO-solutions (actually, the entire path) much more efficiently than by solving the above optimization problem explicitly. A detailed discussion of LARS is omitted here because this involves various technical difficulties and would bring no further insights. Remark 4.16 The concept of path-following algorithms can extended to a whole class of optimization problems. In [69], for example, sufficient and necessary conditions on L and J are derived such that the solution path of min ω R d k=1 n L(y k, x k, ω) subject to J(ω) s, is always piecewise-linear (cf. (4.25)). This especially includes the case where L is convex and piecewise-quadratic, and J is convex and piecewise-linear. The statement of Theorem 4.15 and the related idea of path-following algorithms immediately imply that the feature activation of (l 1 -SVM ) and (LASSO ) happens stepwise. However, a solution path might become very complex (many active variables) for large s, so that we eventually loose control over the feature selection process, especially when the input data set is noisy. Therefore, we are particularly interested in the root of a solution path, meaning that we consider small values of s. Then, ω SVM (s) and ω LASSO (s) are still relatively sparse, and the crucial question is actually whether these feature vectors are really containing interpretable information. In the following subsection, we will see that this is indeed the case for input data which is drawn from the proteomics forward model of (FwM). Remark 4.17 (Scalability of the Algorithms) The parameter s essentially controls the sparsity of the l 1 -SVM- and LASSO-solutions, and usually, ω SVM (s) 0 and ω LASSO (s) 0 are monotonically growing with s. However, the explicit relationship highly depends on the input data x 1,..., x n. For instance, let us assume that the sample vectors are rescaled by x k λx k for a fixed λ > 0. In order to obtain the same output as for the original data, it is then necessary to modify the solution paths as well, namely by s ω SVM (s/λ) and s ω LASSO (s/λ). Finding the optimal value of s is therefore a challenging task in practice, which always relies on the given input data. From a theoretical perspective, this issue is less relevant because we are rather interested in qualitative results here Sparsity and Coherent Features The argumentation of the previous two subsections was rather abstract and also holds for general input data. Now, we return to our specific proteomics model (FwM) and analyze how the solution paths of SPA-Select via (l 1 -SVM) or (LASSO) are behaving in this 1 Recall that X = [ x 1... x n ] R n,d.

78 72 4 Theory for SPA situation. For this purpose, let us recall the notation and the results of Subsection Similar to (Rob 1-Bit), we observe that the minimized loss functionals of (l 1 -SVM) and (LASSO) are both proportional to the empirical expectation of a random variable which depends on X and Y. Thus, there exist again continuous analogs of the respective optimization problems: ( [ ] ) min E 1 Y X std ω R d α, ω subject to ω 1 s, (l 1 -SVM-cont) + ( min E (Y X std ω R d α, ω ) 2) subject to ω 1 s. (LASSO-cont) The squared loss is, due to its smoothness, a bit easier to study than the hinge loss, and therefore, we restrict our statistical analysis to the LASSO here. At first, let us expand the expectation from (LASSO-cont): ( E (Y Xα std = 1 2 Cov(Y, Xα std ), ω + V = 1 2 = 1 + l=1, ω ) 2) = E(Y 2 ) 2E( Y Xα std ( d ) ω l Xl,α std l=1 l=1, ω ) + E( X std α, ω 2 ) d d ω l Cov(Y, Xl,α std ) + ωl 2 V(Xstd l,α ) + ω l ω l Cov(Xl,α std, Xstd l,α ) l l d ω l (ω l V(Xl,α std ) 2 Cov(Y, Xstd l,α }{{ )) + ω l ω l Cov(X std } l l =:Q l (ω l ) l=1 l,α, Xstd l,α ). (4.28) Our goal is still to minimize this expression, which is a multivariate polynomial in ω = (ω 1,..., ω d ). In a first step, let us assume that the feature variables are mutually uncorrelated, implying that the third term in (4.28) vanishes. Then, it suffices to minimize the sum l Q l(ω l ). All functions Q l are hyperbolas which attain their minimum at ω # l := Cov(Y, Xl,α std)/v(xstd l,α ) with Q l (ω # [Cov(Y, Xstd l,α l ) = )]2 (4.19) V(Xl,α std) = ρ 2 X l, respectively. Note that this term is independent of α. This particularly implies that the minimum of Q l continuously depends on the correlation coefficient ρ Xl of the l-th feature. Combining this observation with the results of the previous subsection on solution paths (especially Theorem 4.15), we may conclude that (LASSO-cont) prefers selecting those features which are strongly correlated with Y. 1 Such a behavior is of course very desirable for the objective of SPA, but the assumption of mutually uncorrelated features is obviously not satisfied for (FwM). Thus, we shall fix 1 But one should be aware of the fact that Theorem 4.15 actually applies to the empirical version of (LASSO-cont). Consequently, the above statements will be only approximately true for the finite-sample setting.

79 4.4 Feature Selection via l 1 -SVM and the LASSO 73 two features l 1, l 2 [d] whose covariance Cov(Xl std 1,α, Xstd l 2,α) is non-zero. Then, the above expectation cannot be written as a sum of univariate polynomials (Q 1,..., Q d ) anymore, so that the minimization finally turns into a multivariate problem; the actual coherence between ω l1 and ω l2 is controlled by the magnitude of Cov(Xl std 1,α, Xstd l 2,α), which appears in the third term of (4.28). At first, let us deal with the case where l 1 and l 2 are both close a certain peak center c m. Following the argumentation of Subsection 4.3.3, we obtain an approximation ) ( ) Xl std i,α (α = 1 α li li λ + S m a σa 2 li + σn 2 m,li + N li li m ( ) 1 α li α li λ + ( ) Sm a m σa 2 li + σn 2,l i + N li, li for i = 1, 2. Since σ 2 A li σ 2 N li, we may even assume that there exists θ R with θ 1 such that Xl std 2,α θ Xstd l 1,α. By this estimation, the corresponding terms in (4.28) can be simplified: ω l1 (ω l1 V(Xl std 1,α) 2 Cov(Y, Xstd l 1,α )) + ω l 2 (ω l2 V(Xl std 2,α) 2 Cov(Y, Xstd l 2,α )) + 2ω l1 ω l2 Cov(Xl std 1,α, Xstd l 2,α ) ω l1 (ω l1 V(Xl std 1,α) 2 Cov(Y, Xstd l 1,α )) + ω l 2 (θ 2 ω l2 V(Xl std 1,α) 2θ Cov(Y, Xstd l 1,α )) + 2θω l1 ω l2 V(Xl std 1,α ) = (ω l1 + θω l2 ) 2 V(Xl std 1,α ) 2(ω l 1 + θω l2 ) Cov(Y, Xl std 1,α ( )) ) = (ω l1 + θω l2 ) (ω l1 + θω l2 )V(Xl std 1,α) 2 Cov(Y, Xstd l 1,α ). (4.29) This expression actually only depends on ω := ω l1 + θω l2. Thus, similar to the above case of uncorrelated features, the minimum of (4.29) is attained at ω # := (ω l1 + θω l2 ) # = Cov(Y, Xstd l 1,α ) V(Xl std 1,α ), and the minimal value is equal to ρ 2 X l1 ( ρ 2 X l2 ). Now, we can apply the principle of piecewise-linear solution paths: Supposed that θ < 1 (i.e., Xl std 1,α is larger in magnitude than Xl std 2,α ), the reasoning of the previous subsection shows that the feature variable ω l 2 is not selected by (LASSO-cont) because the minimizer ω # = (ω l1 + θω l2 ) # of (4.29) is faster achieved by enlarging ω l1. For the general situation of (4.28), where all feature variables are involved, we may therefore conclude that (LASSO-cont) tends to activate only the maxima of the Gaussian atoms a 1,..., a M. This gives a strong theoretical evidence that the LASSO is indeed able to avoid redundant feature selections, which are due to coherence of Type II (cf. Subsection 4.1.2). In particular, SPA-Select via (LASSO) does not require an additional postprocessing, such as the component detection of SPA-Sparsify.

80 74 4 Theory for SPA On the other hand, coherence of Type I comes into play when l 1 and l 2 are close to different peak centers, say c m1 and c m2, respectively. If X l1 and X l2 are strongly correlated, i.e., ρ(x l1, X l2 ) 1, the argumentation of the previous paragraph can be adopted, implying that only one of both features is (early) selected by (LASSO-cont). Thus, redundancies of Type I are (partly) eliminated as well. Unfortunately, such a behavior could lead to certain difficulties in practice because sometimes important (low-amplitude) features might be missed that way. This issue will be addressed again in Section 6.6, where we will consider an approach to recover the feature atoms of (FwM) from a fingerprint vector of SPA. Finally, we shall assume that l 1 is located far away from any of the peak centers, meaning that X l1 is dominated by the noise term N l1. Since N l1 is neither correlated with Y nor with X l for l l 1, the covariances Cov(Y, Xl std 1,α) and Cov(Xstd l,α, Xstd l 1,α) are negligibly small in this case, so that we can estimate the corresponding terms of (4.28): ω l1 (ω l1 V(Xl std 1,α) 2 Cov(Y, Xstd l 1,α }{{} ) ) + ω l ω l1 Cov(Xl std,α, Xstd l 1,α }{{} ) l 0 l 1 0 ω 2 l 1 V(X std l 1,α ). Since the right hand side attains its minimum at ω # l 1 = 0, it would not bring any contribution for the minimization of the (expected) squared loss if ω l1 is chosen to be non-zero. This observation indicates that (LASSO-cont) will only rarely select false positive features on the baseline, and consequently, the danger of overfitting is significantly reduced. Remark 4.18 (The Impact of SPA-Standardize) So far, we did not comment on the role of SPA-Standardize. One important fact is that the minimum of (4.29) is (almost) independent of α, whereas the corresponding minimizer ω l # = (ω l1 + θω l2 ) # = Cov(Y, Xstd l 1,α ) V(Xl std 1,α ) = ρ Xl1 V(X std l 1,α ) is not. The actual selection order is therefore significantly influenced by the choice of α, especially when the sparsity parameter s is relatively small. 1 Similar to (Rob 1-Bit), there may arise several problems if α is close to 0 or to 1: A full standardization by α = 0 would imply that θ 1. Thus, none of the two features would be preferred, so that the detection of redundancy (of Type II) might fail in this situation. In contrast, the covariances between the feature variables, Cov(Xl std 1,α, Xstd l 2,α ), could become extremely large if α 1, even when l 1 and l 2 are not closely located around peak centers. This could finally lead to an overrating of irrelevant (high-amplitude) peaks. Consequently, α is still a fundamental tuning parameter that allows us to control the feature ranking (at the root) of the solution path. Let us briefly summarize the findings of this section: One the one hand, we have figured out that the order of selection is essentially determined by the (disease) correlation coeffi- 1 For s, the LASSO degenerates to an ordinary least squares fit. The support of the feature vector would then be independent of α, although its entries are significantly influenced by a standardization.

81 4.4 Feature Selection via l 1 -SVM and the LASSO 75 cients ρ X1,..., ρ Xd, so that strongly correlated features usually appear at an earlier position of the solution path of (LASSO). This observation is particularly consistent with the important principle of LARS (cf. Subsection 4.4.2), stating that precisely those features shall be activated which are mostly correlated with the residual vector y X std ω LARS (s). On the other hand, it has turned out that the LASSO is also able to eliminate redundancies which are caused by coherent features (of Type II). The sparse solution vectors of ω LASSO (s) form therefore promising candidates for a (minimal) disease fingerprint, as they contain only a few non-redundant entries corresponding to the maxima of strongly correlated peak atoms. In conclusion, SPA-Select via (LASSO) indeed meets one of the major challenges of this thesis, that is, extracting biologically relevant information from the simple backward model of (LDM ). Finally, it should be clearly emphasized that the many statements of this chapter are based on heuristic and empirical arguments. In order to make this analysis absolutely rigorous, it seems to be inevitable to develop a more advanced mathematical toolset. These novel methods can be probably designed in a quite abstract fashion because the explicit definition of the feature atoms has played no essential role for the above reasoning. Remark 4.19 (SPA-Select via (l 1 -SVM)) As we have already mentioned above, a similar analysis of (l 1 -SVM-cont) would be much more challenging. This is due to the fact that the hinge loss is a non-smooth function, such that a case distinction is necessary to compute its expectation. But this step would actually involve the underlying probability distribution of X, which has not been further specified. At least, one could use Jensen s inequality (Theorem A.12) to determine a lower bound for the expected hinge loss: E ( [ ] ) 1 Y Xα std, ω + [ 1 E( Y X std α [ ], ω ) ]+ = 1 Cov(Y, Xα std ), ω. + Such an estimate could be indeed useful when deriving necessary conditions for feature selection. Empirical results, such as in Chapter 5, indicate that SPA-Select via (l 1 -SVM) has very similar properties as (LASSO); in particular, one observes that the l 1 -SVM is able to detect coherent features as well.

83 5 Numerical Experiments This chapter presents the results of some numerical experiments for SPA. In Section 5.1, we will consider a simulated data set which is based on the proteomics forward model of Example 4.2. Since all parameters are known in this situation, we are able to evaluate the different solution paths of SPA-Select in a qualitative way. For the real-world data set of Section 5.2, on the other hand, there is no information about the true features available, so that we have to rely on the classification performance of the selected disease fingerprints. One major goal of these experiments is to verify the theoretical statements of the previous chapter. However, we have seen that using SPA involves the choice of various tuning parameters, which make an extensive numerical analysis very costly. Since this thesis is rather theoretically based, we will only provide a proof of concept and focus on the most important aspects here. This especially concerns the impact of SPA-Standardize as well as the choice of the actual selection algorithm for SPA-Select. All experiments were performed with MATLAB (Version R2014b); for details on the implementation, see Appendix B. 5.1 Feature Selection for Simulated Data Data Generation The purpose of this section is to test the performance of SPA for our statistical model from Section 4.1. In order realize the underlying distribution of (X, Y ), we are going to apply the benchmark model of Example 4.2. The following table contains the complete parameter configuration that was used for the experiments: Model Component Realization and Parameters Number of samples n = 200 with #G + = 100 and #G = 100 Spectrum dimension d = 2048 Number of peak atoms M = 10 d Peak centers, cf. (4.3) (c 1,..., c M ) = ( M+1, 2d M+1,..., M d M+1 ) Peak width βm 2 = 70 for m [M] Peak intensities I m N ( M+1 m, 0.01) for m [M] Scaling factors Activation variables M M s m = 1 for m [M] T m Ber(p m ) with p m = m 1 M for m [M] Table 5.1: Specification of the parameters and probability distributions for the simulated data set, based on Example

84 78 5 Numerical Experiments Model Component Random peak shifts Background noise Realization and Parameters C m N (0, σshift 2 ) with σ shift = 3 for m [M] N l N (0, σnoise 2 ) with σ noise = 0.01 for l [d] Table 5.1 (continued) To obtain equal group sizes, the disease labels are fixed in advance: y 1,..., y n/2 = +1 and y n/2+1,..., y n = 1. The corresponding spectra x 1,..., x n R d are then generated i.i.d. according to Example 4.2. In particular, the latent factors for the k-th sample are drawn from (cf. (4.6)) 1 S m = I m 1 {yk =+1} + I m (1 + s m T m ) 1 {yk = 1}, m [M]. By the parameter choices of Table 5.1, every spectrum consists of exactly ten Gaussian atoms (of fixed width) whose centers are uniformly arranged. The (mean) amplitudes of the peaks are linearly decreasing with m. The actual intensity of a peak is normally distributed, where the corresponding deviation is proportional to its mean value. In this way, higher peaks also have a larger variance. The activation probabilities, on the other hand, are increasingly ordered. Consequently, the relevance of a feature atom (correlation with the disease label) is inversely proportional to its amplitude (mean intensity); thus, the largest peak (m = 1) is completely uncorrelated, whereas the smallest one (m = M) is extremely relevant with 90% activation probability. Finally, the peak centers as well as the baseline are affected by (independent) normally distributed noise. Figure 5.1 shows typical samples from the healthy and diseased group Performed Experiments The main motivation behind the inverse proportionality between correlation and intensity is to illustrate that there are situations where SPA without SPA-Standardize (that is, α = 1) might fail. The following numerical simulations show how SPA performs for different choices of the interpolation parameter α as well as for different selection algorithms. The below table presents all those tuning parameters which have been fixed for the experiments: SPA-Step Parameter Value SPA-Normalize Scaling factors λ 1,..., λ n = 1 Wavelet and low-pass filter morlet 1d filterbank of ScatNet SPA-Scattering Wavelet scale j = 4 Window scale J = 4 SPA-Standardize Correlation parameter c exp = 5 Scaling parameter λ = 1/m( X scat ) SPA-Sparsify Threshold ε = 10 3 Note that a scaling by SPA-Normalize is actually not necessary (trivial choice of the 1 Note that we have assumed that all random variables of Table 5.1 are independent of Y.

85 5.1 Feature Selection for Simulated Data 79 (a) Sample from G + (healthy) (b) Sample from G (diseased) Figure 5.1: Two samples from the simulated data set. The peak amplitudes of (a) have a relatively small deviation. In (b), several features are activated (indicated by the red arrows), meaning that their amplitudes are twice as high as in the case of a healthy sample. scaling factors) because all spectra are drawn from the same distribution. The morlet 1d filterbank corresponds to the implementation of the Morlet wavelet (cf. (3.4)) which is contained in the ScatNet-package (see also Remark 3.3(2)). The remaining parameters (j, J, c exp, λ, ε) are chosen in an adaptive manner; in fact, this configuration works empirically well, but there is no reason to assume that these are the optimal values for our data set. The framework of SPA (Algorithm 3.1) is now applied for all three approaches of SPA-Select, i.e., (Rob 1-Bit), (l 1 -SVM), and (LASSO), as well as for several interpolation parameters, α {1.0, 0.5, 0.1}. For each combination, the feature selection is performed with numerous sparsity parameters s such that the resulting feature vectors cover a large part of the solution paths; the actual range of considered s is again adaptively chosen, particularly depending on the scaling of the data (cf. Remark 4.17). Furthermore, the component detection of SPA-Sparsify (Line 2 and 3) has been omitted for the cases of (l 1 -SVM) and (LASSO), in order to show that these approaches are both able to perform non-redundant feature selections. The results of the experiments are visualized by path plots in Figure 5.3 and Figure 5.4. These plots were generated as follows: Every horizontal section (s is fixed) represents one of the feature vectors ω(s) {ω 1-Bit (s), ω SVM (s), ω LASSO (s)}, respectively. Moreover, ω(s) is normalized by its l 2 -norm, and therefore, the gray scale values correspond to the absolute values of ω(s)/ ω(s) 2 (white = 0 and black = 1). In this way, the relative changes of ω(s) for different values of s become more visible. 1 1 Without such a normalization, we could observe how each individual variable changes in s, but this would not necessarily reflect the relevance of the features.

86 (a) x for α = 1.0 (b) x for α = 0.5 (c) x for α = 0.1 Figure 5.2: Plot of the empirical covariance vector x = n k=1 y kx std k to which the soft-thresholding of (Rob 1-Bit) is applied. Note that we are already working with the scattering representation of the data here, i.e., d = 2048/2 4 = 128. (a) (Rob 1-Bit) with α = 1.0 (b) (Rob 1-Bit) with α = 0.5 (c) (Rob 1-Bit) with α = 0.1 Figure 5.3: Path plots for SPA via (Rob 1-Bit). The red dots indicate the true peak positions Numerical Experiments

87 (a) (l 1 -SVM) with α = 1.0 (b) (l 1 -SVM) with α = 0.5 (c) (l 1 -SVM) with α = 0.1 (d) (LASSO) with α = 1.0 (e) (LASSO) with α = 0.5 (f) (LASSO) with α = 0.1 Figure 5.4: Path plots for SPA via (l 1 -SVM) and (LASSO). The red dots indicate the true peak positions. 5.1 Feature Selection for Simulated Data 81

88 82 5 Numerical Experiments Discussion Let us first consider the results for (Rob 1-Bit) in Figure 5.3, which are relatively easy to explain. Indeed, we have seen in Section 4.3 that SPA-Select via (Rob 1-Bit) is just a soft-thresholding of the vector x := n k=1 y kx std k R d, which is also illustrated in Figure 5.2. Combining this with the step of SPA-Sparsify, we immediately obtain the corresponding path plots of Figure 5.3. In order to understand the case of α = 1.0, we shall recall the computation of the covariance in (4.21). For our concrete realization of X, one can easily show that Cov(Y, Xl,1 std) is essentially proportional to E(I m ) s m p m if l is close to the center of the m-th peak. Hence, the middle peaks of x are the most significant ones (see Figure 5.2(a)). Such a behavior is of course undesirable, since the most relevant features are selected relatively late in the path. But as we have argued in the theoretical part, this drawback can be resolved by decreasing α. The selection order for α = 0.5 has significantly drifted to the right peaks, and finally, if α = 0.1, the ranking of the features almost coincides with the ranking of the activation probabilities p 1,..., p M. However, we can clearly observe in Figure 5.2(c) that the second last peak seems to be less relevant than the third last one. This circumstance which is due to the noisy amplitudes emphasizes once more that a standardization of the data should be handled with care, especially when only a few samples are available. The behavior of the path plots for (l 1 -SVM) in Figure 5.4 is obviously more complicated. At the very beginning of the path (s small), the hinge loss functional is actually linear, and similar to (Rob 1-Bit), the first activated feature corresponds to the maximal entry of x. But eventually, the non-linearity of the hinge loss comes into play. Then, correct classifications do not further contribute to the optimization anymore (data point lies behind the margin), and consequently, more relevant features are selected. This can be clearly observed for the path plots in the Figures 5.4(a) and 5.4(b). That way, we finally obtain a feature ranking which is consistent with the activation probabilities, even without standardizing. However, one has to be careful when choosing s too large, since the optimization problem of the l 1 -SVM becomes unconstrained at some point. 1 In this situation, we may loose some important properties, such as the implicit detection of redundancies. The selection process therefore becomes unstable and some uncorrelated features on the baseline are activated. One possible way out is again to decrease the interpolation parameter α, such that the most relevant features are pushed toward the root of the solution path (see Figure 5.4(c)). The results for the LASSO in Figure 5.4(d) 5.4(f) are very similar to those for the l 1 -SVM; this observation is remarkable, since the respective loss functions of both approaches are quite different. When s is too large, the optimization of (LASSO) becomes unconstrained as well, which means that we end up with an ordinary least squares fit. Finally, it should be again emphasized that all results of Figure 5.4 were achieved without the step of SPA-Sparsify, showing that the l 1 -SVM and the LASSO are both able to perform a non-redundant feature selection (at least at the root of the path). 1 This means that the solution of (l 1 -SVM) coincides with the minimizer of the hinge loss functional without any constraint ( s = ).

89 5.1 Feature Selection for Simulated Data 83 (a) Without SPA-Scattering (b) With SPA-Scattering Figure 5.5: Plots of the covariance vector x with and without using the scattering transform (α = 1.0). (a) Noisy data (b) Noiseless data (σ shift = 0 and σ noise = 0) Figure 5.6: Plots of the covariance vector x for data sets with and without noise (α = 0.0). The above experiments verify that the theoretically predicted behavior of SPA can be also observed in practice. In particular, we could illustrate the benefit of SPA-Standardize for the feature selection. But one should be always aware of the fact that the actual success of a fingerprint detection is extremely sensitive to the choice of the tuning parameters and heavily depends on the quality of the raw data. Remark 5.1 (1) The Impact of SPA-Scattering: The benefit of SPA-Scattering can be visualized by plotting the vector x = n k=1 y kx std k where this preprocessing step was completely omitted (see Figure 5.5). By the definition of our simulation model, all peak centers are affected by small random shifts. This explains the significant amplitude of the first feature atom (m = 1) in Figure 5.5(a), although it is actually not correlated with the disease. Thus, even small peak shifts can lead to high-amplitude noise in the raw data, which could cause serious problems, especially when using (Rob 1-Bit). Fig-

90 84 5 Numerical Experiments ure 5.5(b) shows that the influence of such noise type can be significantly reduced by SPA-Scattering. (2) The Role of the Baseline Noise: We have already pointed out in Remark 4.12 that the complete absence of (baseline) noise could surprisingly have negative consequences, particularly when the interpolation parameter α is too small. Figure 5.6 shows the covariance vectors x for two simulated data sets, one without any noise (σ shift = 0 and σ noise = 0) and the other one generated as described above (see Table 5.1). For the noiseless case, the good localization of the peak maxima has completely got lost, and moreover, there appear spiked artifacts between the individual atoms. The feature selection would therefore become highly unstable in this situation. 5.2 Classification Performance for Real-World Data Data Sets The real-world data set that is used for the following experiments was obtained from the University Hospital Leipzig (UHL). This clinical study contains samples from 20 patients with pancreatic cancer and 19 controls. For each test person, four (MALDI-TOF) mass spectra were generated, such that we have 76 healthy (G + = 76) and 80 diseased samples (G = 80) in total. 1 For more details on the specific sample preparation, see [31]. The only preprocessing step which was applied to the raw data is a baseline subtraction by top-hat filtering (cf. Remark 3.3 and [21]). Since the implementation of SPA- Scattering by ScatNet does only allow for input vectors whose dimension is a power of 2, all spectra have been cut at the index of 32,768 (the original raw spectra contain 42,390 entries). This particularly implies that only a part of the raw data is used; but this restriction is not crucial because the purpose of the subsequent experiments is not to extract a specific biomarker Performed Experiments Contrary to the simulated data set of the previous part, we neither know where the feature atoms of the input spectra are located nor how relevant they are. Therefore, analyzing the performance of SPA by a path plot is inappropriate in this setting. One possibility to still evaluate the results of a feature selection is to perform a K-fold cross-validation (CV). The basic idea here is to (randomly) split the sample set into K parts (folds). Then, we use K 1 of these folds (the training set) to learn an appropriate feature vector, which is finally applied to classify the remaining fold (the test set). Algorithm 5.1 shows how we specifically proceed for SPA. 1 In particular, the theoretical assumption that all samples were acquired independently is obviously not satisfied here, since we have obtained four spectra from each patient. The effective number of (independent) samples is therefore actually smaller than n = 156.

91 5.2 Classification Performance for Real-World Data 85 Algorithm 5.1: Classification Performance of SPA via Cross-Validation Input: Raw data (x 1, y 1 ),..., (x n, y n ) R d { 1, +1}; SPA-configuration; Number of CV-folds K Output: Classification accuracy Acc [0, 1]; Average sparsity #F (number of active features) 1 Split the sample set {1,..., n} randomly into K disjoint folds P 1,..., P K [n] of (almost) equal size. For each fold k {1,..., K} compute: 2 Compute the feature vector ω k via SPA by using the samples of k [K]\{k} P k. 3 Dimension reduction via SPA-Reduce: Project all spectra onto supp(ω k ) and put #F k := ω k 0. 4 Classification of P k : Use the projected samples of k [K]\{k} P k to predict the labels of the spectra in P k by an ordinary SVM. Denote the prediction accuracy 1 by Acc k. 5 Compute the average accuracy Acc := 1 K K k=1 Acck and average sparsity #F := 1 K K k=1 #F k. It is very important to observe that the classifier ω k is always trained without using the samples of P k (cf. Line 2), i.e., no information of the k-th fold is incorporated. The actual prediction of the labels of P k takes place in Line 4 by using an l 2 -SVM. 2 Here, we could have also applied another standard classification algorithm from machine learning, since the input data set is now relatively low-dimensional. Furthermore, it should be emphasized that Algorithm 5.1 makes only use of the support of ω k and not of its entries. The accuracy therefore solely depends on the position of the selected features and not on their respective weightings. The following table shows those SPA-parameters which remain fixed for the experiments in this section: SPA-Step Parameter Value SPA-Normalize Scaling factors λ k = 1/ x k 1 for k [n] Wavelet and low-pass filter morlet 1d filterbank of ScatNet SPA-Scattering Wavelet scale j = 5 Window scale J = 5 SPA-Standardize Correlation parameter c exp = 5 Scaling parameter λ = 1/m( X scat ) SPA-Sparsify Threshold ε = This means that we compare the predicted labels to the known sample labels y 1,..., y n, respectively. 2 See also Remark Recall that SPA-Reduce particularly performs a preprocessing by SPA-Normalize and SPA-Scattering before the data vectors are finally projected.

92 86 5 Numerical Experiments The choice of the scaling factors in SPA-Normalize is based on the idea of total ion count from Subsection 3.2.1, meaning that the total number of ions should be equal for all spectra. The remaining parameters were again adaptively chosen. Similar to the simulation study of Section 5.1, SPA is then performed for different interpolation parameters, α {1.0, 0.5, 0.2, 0.0}, and selection algorithms (again with a whole range of sparsity parameters s). More precisely, the cross-validation of Algorithm 5.1 is applied for each combination with K = 5. This step is repeated 12 times, using exactly the same configuration (but different random partitions), in order to make the results more reliable; the final accuracy and average sparsity are then obtained by computing the mean values over all iterations. Figure 5.7 shows the results of these experiments. Every plot corresponds to a specific configuration, visualizing the relationship between the (average) sparsity #F and the achieved classification accuracy Acc. Note that the component detection of SPA- Sparsify (Line 2 and 3) was again omitted for (l 1 -SVM) and (LASSO), in order to verify their ability of selecting non-redundant features Discussion The major purpose of the plots in Figure 5.7 is illustrate how accurate a certain SPAconfiguration can classify the training set at a fixed level of sparsity. A first important observation is that all curves of Figure 5.7 make a jump at a relatively small value of #F, and afterwards, the accuracy grows more slowly. Such a behavior indicates that strongly correlated features have been selected already at the very beginning of the solution paths. And indeed, this precisely meets our goal of finding a sparse classifier which also achieves a good generalization performance. Comparing the results of (Rob 1-Bit) for different values of α, it turns out that the accuracy for #F 10 becomes significantly better as α decreases. This observation is particularly consistent with the theory of Section 4.3, where we have figured out that the feature ranking for α 0 is essentially ordered by the respective disease correlations (cf. (4.22)). The overall classification performance of (Rob 1-Bit), on the other hand, becomes slightly worse when α is small. In fact, the influence of the baseline noise is greater in this case, so that more false-positive features have been selected during the cross-validation procedure of Algorithm 5.1. The latter phenomenon can be observed for (l 1 -SVM) and (LASSO) as well. This coincides with Remark 4.18 which states that choosing α too small could have a negative impact on the stability. For α = 1.0 and α = 0.5, in contrast, we achieve almost 100% accuracy for both algorithms, which clearly outperforms (Rob 1-Bit). In conclusion, multivariate approaches should be usually preferred for practical applications, especially because of their ability to avoid redundant feature selections. Finally, it is worth mentioning that similar to the experiments for the simulated data both (l 1 -SVM) and (LASSO) show almost the same performance, although they penalize (mis-)classifications in a completely different manner.

93 (a) α = 1.0 (b) α = 0.5 (c) α = 0.2 (d) α = 0.0 (e) α = 1.0 Figure 5.7: Classification results of SPA for real-world data. The plots of 5.7(a) 5.7(d) compare the performance of (Rob 1-Bit), (l 1 -SVM), and (LASSO) for different values of α. Figure 5.7(e) shows how the behavior of (Rob 1-Bit) changes when the step of SPA- Scattering is omitted. 5.2 Classification Performance for Real-World Data 87

94 88 5 Numerical Experiments The remaining plot of Figure 5.7(e) visualizes the impact of SPA-Scattering. The two accuracy curves actually behave very differently, indicating that the active feature variables substantially differ from each other. This particularly explains why SPA without SPA-Scattering is slightly more accurate for #F 17. But as one might have expected, the overall classification performance with the additional step of SPA-Scattering is better, especially for small values of average sparsity (#F between 5 and 15). To summarize, the above experiments have verified that SPA can provide very promising results for both simulated and real-world data. Furthermore, we could observe many connections to the theoretical statements of Chapter 4, which have particularly helped us to better understand the numerical simulations. But however, it still remains the major challenge of appropriately choosing the tuning parameters, which showed to have a significant influence on the quality of the results. Remark 5.2 The original work of SPA [21] presents similar experiments for the same real-world data set that we have used in the current section. The interested reader might also consider the discussion part of this paper because it includes a medical interpretation of the detected biomarkers, which is not contained here. The authors of [21] have additionally incorporated some information criteria 1 for their experiments in order to determine the optimal sparsity parameter s for a given SPAconfiguration. Indeed, this is a very common strategy in learning theory; but since the optimality of s does not necessarily indicate the biological relevance of a classifier, we have confined ourselves to evaluating the results visually, such as in the Figures 5.3, 5.4, and 5.7. Moreover, [21] compares the performance of SPA (via (Rob 1-Bit)) to several state-ofthe-art methods, such as the LASSO and the elastic net (without any pre- and postprocessing). In this context, SPA clearly outperforms the other approaches. But such a comparison seems to be a bit unfair because the superiority of SPA is essentially due to the preprocessing, which could be also applied to other common feature selection algorithms. And anyway, our major goal is rather to analyze the theoretical aspects of SPA, and not to provide the best-tuned method for proteomics data processing. 1 For simulated data, the BIC (Bayes Information Criterion) is used, and for real-world data, the AUC-ROC (Area Under the ROC-Curve).

95 6 Extensions and Further Approaches The previous chapters already contain several hints at possible extensions of SPA. This chapter presents some of these extensions, which could be interesting for future work. For the sake of brevity, the following sections are written in a much more sketchy and less rigorous language compared to the theory part of Chapter Using Multiple Layers of the Scattering Transform The application of the scattering transform in SPA-Scattering restricts only to a single node (path) in the first layer of the underlying convolutional network. This limitation could lead to serious problems when we would like to apply SPA to more general types of input data. A typical example would be two-dimensional mass spectra (see Figure 6.1). Such a datum is usually represented by a matrix x R d 1,d 2, instead of a vector. Thus, it would be important to first extent SPA-Scattering by the higher-dimensional version of the scattering transform, which particularly includes the introduction of oriented wavelets (cf. [12, 57]). In this situation, it could easily happen that a single wavelet scale is not sufficient anymore to capture all relevant details. Therefore, as a second extension, one needs to make SPA capable of dealing with multiple nodes of a scattering network. But such a step would also entail a generalization of the feature selection process. For example, let us suppose that applying a vector x R d to SPA-Scattering yields two nodes of the scattering transform, say x scat,1, x scat,2 R d. This doubles the number of feature variables in SPA-Select; but actually, many entries of x scat,1 and x scat,2 are strongly coherent, since x scat,1 l and x scat,2 l correspond to the same spatial location of x for every l [d ]. For this reason, it would be useful to incorporate a certain linking between feature variables in SPA-Select. A very popular approach to meet this challenge is the so-called grouped LASSO. The basic idea here is to partition the feature index set [d] = {1,..., d} into D groups G 1,..., G D [d] and then to modify the penalty constraint of (LASSO ) as follows: min ω R d k=1 n (y k x k, ω ) 2 subject to D gm ω Gm 2 s, m=1 (Grp-LASSO) 89

96 90 6 Extensions and Further Approaches Intensity (cts) Time (s) Mass (m/z) Figure 6.1: Illustration of a two-dimensional mass spectrum. In this example, one has acquired (ordinary) MS-data of some (fixed) sample for a certain period of time. Thus, we have two independent components here, namely the mass and the time. Depending on the time resolution, such a data matrix could have hundreds of millions of entries. where g m := #G m and ω Gm 2 = l G ω2 m l. Thus, we do not simply compute the l 1 -norm of all entries of ω but of the euclidean norms of the groups. Since ω Gm 2 is zero if and only if all components of ω Gm are zero, it turns out that solving (Grp-LASSO) promotes sparsity at both the group and individual level (cf. [40]). In the above example, we would consider vectors of the form [ x scat,1 x scat,2] R 2d and put G m := {m, m+d } for m [d ] (i.e., d = 2d and D = d ). In this way, the spatially-correlated variables of the scattering transform get linked and SPA-Select via (Grp-LASSO) would tend to either select both features or none of them. For further details on the grouped LASSO, the reader is referred to the works of [13, 42, 87], and for the more general concept of group sparsity (or structured sparsity), one may consider [28, 43]. 6.2 Feature Space Maps and Kernel Learning As already mentioned in Subsection 3.2.2, the scattering transform can be also seen as a highly non-trivial feature space map. In general, a feature space map is a function Φ: R d H from the raw-data-space R d into some (possibly infinite-dimensional) Hilbert space H. The main purpose of such a transform is that the data could be (more easily) separated by a lifted classification model y = sign( Φ(x), ω H ), y { 1, +1}, x R d, ω H. However, the feature map Φ could be very complicated and sometimes impossible to evaluate in practice. Therefore, it was a crucial observation that the classifiers of many algorithms, most prominently SVMs, can be solely expressed in terms of the so-called

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015