Diversity-Promoting and Large-Scale Machine Learning for Healthcare

Size: px

Start display at page:

Download "Diversity-Promoting and Large-Scale Machine Learning for Healthcare"

Lizbeth Joseph
6 years ago
Views:

1 Diversity-Promoting and Large-Scale Machine Learning for Healthcare Pengtao Xie December 2017 School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Eric P. Xing, Chair Ruslan Salakhutdinov Pradeep Ravikumar Ryan Adams (Princeton) David Sontag (MIT) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2017 Pengtao Xie

3 Abstract In healthcare, a tsunami of medical data has emerged, including electronic health records, images, literature, etc. These data can be heterogeneous and noisy, which renders clinical decision-making time-consuming, error-prone and suboptimal. In this thesis, we develop machine learning (ML) models and systems for distilling high-value patterns from unstructured clinical data and making informed and realtime medical predictions and recommendations, to aid physicians in improving the efficiency of workflow and quality of patient care. When developing these models, we encounter several challenges: (1) How to better capture infrequent clinical patterns, such as rare subtypes of diseases; (2) How to improve the timeliness of decision-making without sacrificing its quality? (3) How to make the models generalize well on unseen patients? (4) How to promote the interpretability of the decisions? (5) How to efficiently discover massive clinical patterns from large-scale data? To address challenges (1-4), we systematically study diversity-promoting learning, which encourages the components in ML models (1) to diversely spread out to give infrequent patterns a broader coverage, (2) to be mutually complementary for more compact representation of information, (3) to be imposed with structured constraints for better generalization performance and (4) to be less redundant for better interpretation. The study is performed in the context of both frequentist statistics and Bayesian statistics. In the former, we develop diversity-promoting regularizers that are empirically effective, theoretically analyzable and computationally efficient. In the latter, we develop Bayesian priors that effectively entail an inductive bias of diversity among a finite or infinite number of components and facilitate the development of efficient posterior inference algorithms. To address challenge (5), we study large-scale learning. Specifically, we design efficient distributed ML systems by exploiting a system-algorithm co-design approach. Inspired by a sufficient factor property of many ML models, we design a peer-to-peer system Orpheus that significantly reduces communication and computation costs. We apply the proposed diversity-promoting learning (DPL) techniques and distributed ML systems to address several critical issues in healthcare, including discharge medication prediction, automatic generalization of medical-imaging reports, automatic ICD code filling, similar-patient retrieval and large-scale medical-topic discovery. Evaluations on various clinical datasets demonstrate the effectiveness of the DPL methods and efficiency of the Orpheus system.

4 Contents 1 Introduction Thesis Introduction and Scope Contributions Diversity-Promoting Learning Large-Scale Distributed Learning ML for Healthcare Proposed Timeline Diversity-Promoting Learning I Regularization Uncorrelation and Evenness: a Diversity-Promoting Regularizer Uniform Eigenvalue Regularizer Case Study: Distance Metric Learning Evaluation Convex Diversity-Promoting Regularizers Nonconvex Bregman Matrix Divergence Regularizers Convex Bregman Matrix Divergence Regularizers A Proximal Gradient Descent Algorithm Evaluation Angular Constraints for Improving Generalization Performance Angular Constraints An ADMM-based Algorithm Evaluation Diversity-Promoting Learning II Bayesian Inference Diversity-Promoting Learning of Bayesian Parametric Models Mutual Angular Process Case Study: Bayesian Mixture of Experts Model A Variational Inference Algorithm Evaluation Diversity-Promoting Learning of Bayesian Nonparametric Models Infinite Mutual Angular Process Case Study: Infinite Latent Feature Model An MCMC Sampling Algorithm iv

5 3.2.4 Evaluation Diversity-Promoting Learning III Analysis Analysis of Better Capturing of Infrequent Patterns Analysis of Generalization Errors Generalization Error Analysis for Angular Constraints Estimation Error Analysis for Nonconvex Bregman Matrix Divergence Regularizers Large-Scale Learning via System and Algorithm Co-design Sufficient Factor Property Orpheus: a Light-Weight Peer-to-Peer System Communication Computation Convergence Analysis Evaluation Applications to Healthcare Discharge Medication Prediction at Admission Time Methods Evaluation Proposed: Diversity-Promoting Learning for More Accurate Prediction of Infrequent Medications Automatic Generation of Text Reports for Medical Images Methods Evaluation Proposed: Diversity-Promoting Learning for More Interpretable Report- Generation Automatic ICD Code Filling Methods Evaluation Proposed: Diversity-Promoting Learning for Overfitting Alleviation in ICD Coding Diversity-Promoting Learning for Faster Retrieval of Similar Patients Large-Scale Distributed Learning for Medical Topic Discovery Bibliography 57 v

6 Chapter 1 Introduction 1.1 Thesis Introduction and Scope With the widespread adoption of Electronic Health Records (EHR) systems, a tsunami of medical data has emerged, which is becoming an increasingly important source of holistic and detailed information for both healthcare providers and receivers. Collectively analyzing and digesting these rich information generated from multiple sources; uncovering the health implications, risk factors, and mechanisms underlying the heterogeneous and noisy data records at both individual patient and whole population levels; making clinical decisions including diagnosis, triage, and treatment thereupon, are now routine activities expected to be conducted by medical professionals including physicians, nurses, pharmacists and so on. As the amount and complexity of medical data are rapidly growing, these activities are becoming increasingly more difficult for human experts. The information overload makes medical analytics and decision-making timeconsuming, error-prone, suboptimal and less-transparent. The advancement of machine learning (ML) technology opens up opportunities for next generation computer-aided medical data analysis and data-driven clinical decision making, where machine learning algorithms and systems can be developed to automatically and collectively digest massive medical data such as electronic health records, images, behavioral data, and the genome, to make data-driven and intelligent diagnostic predictions. An ML system can automatically analyze multiple sources of information with rich structure, uncover the medicallymeaningful hidden concepts from low-level records to aid medical professionals to easily and concisely understand the medical data, and create a compact set of informative diagnostic procedures and treatment courses and make healthcare recommendations thereupon. We aim at developing machine learning algorithms and systems for automatic smart datadriven medical predictions, recommendations and decision-making. Specifically, we focus on the following clinical applications: predicting discharge medications, automatically filling ICD code, generating reports from medical images, measuring patient similarity, discovering medical topics from texts. During the development of these algorithms, we identify several fundamental issues. How to better capture infrequent patterns? At the core of ML-based healthcare is to discover the latent patterns (e.g., topics in clinical notes, disease subtypes, phenotypes) underly- 1

7 ing the observed clinical data. Under many circumstances, the frequency of patterns is highly imbalanced [113]. Some patterns have very high frequency while others occur less frequently. For instance, in intensive care units, diseases (patterns) such as hemorrhage and pneumonia occur frequently while others like chronic rhinitis and congenital hypothyroidism are of low frequency. Existing ML models lack the capability to capture infrequent patterns, which is possibly due to the design of their objective function used for training [109]. For example, a maximum likelihood estimator would reward itself by modeling the frequent patterns well as they are the major contributors to the likelihood function. On the other hand, infrequent patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and they tend to be ignored. Infrequent patterns are of crucial importance in clinical settings. For example, many infrequent diseases are life-threatening. It is critical to capture them. How to compress model size without sacrificing modeling power? In clinical practice, making a timely decision is crucial for improving patient outcome. To achieve time efficiency, the size (specifically, the number of weight parameters) of ML models needs to be kept small. However, reducing the model size, which accordingly reduces the capacity and expressivity of this model, typically sacrifice modeling power and performance. It is technically appealing but challenging to compress model size without losing performance. How to alleviate overfitting? In certain clinical applications, the number of medical records available for training is limited. For example, when training a diagnostic model for an infrequent disease, we typically have no access to a sufficiently large number of patient cases due to the rareness of this disease. Under such circumstances, overfitting easily happens: the trained model works well on the training data but generalizes poorly on unseen patients. To alleviating overfitting, we need to incorporate prior beliefs about the model structure. How to improve interpretability? Being interpretable and transparent is a must for an ML model to be willingly used by human physicians. Oftentimes, the patterns extracted by existing ML methods have a lot of redundancy and overlapping [97], which are ambiguous and difficult to interpret. For example, in computational phenotyping from EHR, it is observed that the learned phenotypes by the standard matrix and tensor factorization algorithms have much overlap, causing confusion such as two similar treatment plans are learned for the same type of disease [97]. It is necessary to make the learned patterns distinct and interpretable. How to efficiently learn large-scale models? In certain healthcare applications, both the model size and data size are large, incurring substantial computation overhead that exceeds the capacity of a single machine. It is necessary to design and build distributed systems to efficiently train such models. To solve the first four problems, we study diversity-promoting learning (DPL) [61, 106, 107, 109, 110, 112, 113, 114, 115, 116, 117]. Many ML models are equipped with components, each aiming at capturing a latent pattern and is parameterized by a weight vector. For instance, in a topic model [15], the components are referred to as topics, aiming at discovering the semantics underlying documents. Each topic is associated with a weight vector. DPL aims at encouraging the component vectors to be diverse. First, regarding better capturing infrequent patterns, diversified components are expected to be less aggregated over frequent patterns and part of 2

8 them would be spared to cover the infrequent patterns [107]. Second, regarding performancelossless model-compression, diversified components bear less redundancy and are mutually complementary, making it possible to capture information sufficiently well with a small set of components [106]. Third, regarding alleviating overfitting, promoting diversity imposes a structural constraint on model parameters, which reduces the model capacity and therefore improves generalization performance on unseen data [110]. Four, regarding interpretability, if components are encouraged to be distinct from each other and non-overlapping, then it would be cognitively easy for a human to associate each component to an object or concept in the physical world [97]. To address the fifth problem, we design efficient distributed ML systems [105, 111, 119, 126], by exploiting a system-algorithm co-design approach: system design should be tailored to the unique mathematical properties of ML algorithms, and algorithms can be re-designed to better exploit the system architecture. We apply the developed diversity-promoting learning techniques and distributed ML systems to the aforementioned healthcare applications [45, 82, 85, 121, 128]. 1.2 Contributions Overall, the contributions of this thesis are made in three areas: diversity-promoting learning, large-scale distributed learning, and ML-based healthcare Diversity-Promoting Learning This thesis work represents the first one that systematically studies this new learning paradigm: diversity-promoting learning. The study is performed in the context of both frequentist statistics and Bayesian statistics. In the former, we develop diversity-promoting regularizers that are empirically effective, theoretically analyzable and computationally efficient. In the latter, we develop Bayesian priors that effectively entail an inductive bias of diversity among a finite or infinite number of components and facilitate the development of efficient posterior inference algorithms. Diversity-Promoting Regularization In diversity-promoting regularization, we made the following contributions. We propose to characterize diversity from two perspectives: uncorrelation and evenness, based on which we define a uniform eigenvalue regularizer (UER). Compared with previous diversity-promoting regularizers, the UER is able to measure diversity in a global way, is insensitive to vector scaling and is amenable to computation. We apply UER to distance metric learning and long short-term memory networks and develop an efficient projected gradient descent algorithm. In various experiments, we demonstrate the effectiveness of UER in better capturing infrequent patterns, reducing model size without sacrificing modeling power and improving generalization performance. Considering UER is nonconvex which presents great challenges for optimization, we develop a family of convex diversity-promoting regularizers based on Bregman matrix divergence 3

9 (BMD), where the global optimal is guaranteed to be achievable. We apply these regularizers to distance metric learning and develop an efficient proximal gradient algorithm. In experiments, we demonstrate the advantages of the convex BMD (CBMD) regularizers over nonconvex counterparts. First, because the global optimal solution is achievable, CBMD obtains better modeling performance. Second, unlike nonconvex regularizers that need multiple (random) restarts for a better local optimal, CBMD runs only once, hence is computationally more efficient. While UER and convex BMD regularizers are empirically effective in alleviating overfitting, a theoretical analysis of their effectiveness is difficult to establish. In light of this, we propose a new regularization approach: angular constraints (ACs), that is both empirically effective and theoretically analyzable. The analysis reveals that properly manipulating the ACs can achieve the lowest generalization errors. We develop an efficient ADMM-based algorithm and demonstrate the empirical effectiveness of ACs in alleviating overfitting of deep neural networks and sparse coding. Diversity-Promoting Bayesian Learning In diversity-promoting Bayesian learning, we made the following contributions. We define a mutual angular process (MAP), which is a Bayesian prior biased toward components that have large mutual angles. This prior facilitates the development of variational inference an efficient posterior inference algorithm. We apply this prior to a Bayesian mixture of experts model and demonstrate its effectiveness and efficiency in experiments. To promote diversity in Bayesian nonparametric models, we extend MAP to infinite MAP (IMAP) which encourages infinitely many components to have large mutual angles. We apply IMAP to an infinite latent feature model and develop a sampling algorithm based on slicing sampling and Riemann manifold Hamiltonian Monte Carlo. Experiments demonstrate the effectiveness of IMAP. Theoretical Analysis We performed various analysis to formally understand the effectiveness of promoting diversity and made the following contributions. We analyze why the nonconvex Bregman matrix divergence (BMD) regularizers can better capture infrequent patterns. In the context of distance metric learning, we define an imbalance factor (the lower, the better) to characterize the performance on infrequent patterns. The analysis shows that decreasing the BMD regularizers can reduce the upper bound of the imbalance factor and hence achieve better performance on infrequent patterns. We analyze how the angular constraints (ACs) affect the generalization error (which is the sum of estimation error and approximation error) of neural networks. The analysis reveals that a stronger regularization reduces estimation errors and increases approximation errors. Properly tuning the regularization strength can achieve the best tradeoff among these two types of errors and accordingly the optimal generalization performance on unseen data. 4

10 We analyze how the log-determinant divergence (LDD) regularizer, which is a specific nonconvex BMD regularizer, affects the estimation error of distance metric learning. The analysis shows that decreasing LDD can effectively reduce the estimation error bound Large-Scale Distributed Learning The second part of this thesis studies large-scale learning. We design efficient distributed ML systems by exploiting a system-algorithm co-design approach. Specifically, inspired by a mathematical property of many ML models that are parameterized by matrices: sufficient factor (SF) property, we design a peer-to-peer system that significantly reduces communication and computation costs. We made the following contributions. For efficient communication, we propose (1) sufficient factor broadcasting (SFB) which transfers small-sized vectors among machines for the synchronization of matrix-form parameters, (2) random multicast where each machine randomly selects a subset of machines to communicate within each clock, (3) SF selection that selects a subset of most representative SFs to communicate. These techniques greatly reduce the number of network messages and the size of each message. For efficient computation, we propose to represent the parameter matrix using SFs and propose an SF-aware approach for matrix-vector multiplication, which reduces the cost of being quadratic in matrix dimensions down to being linear. We conduct convergence analysis of the SFB computation model. The analysis shows that though synchronized in a decentralized manner, the parameter replicas on different machines converge to the same optimal. We evaluate our system on three representative ML models and show that (1) it achieves high efficiency. It is up to 13.1x, 5.6x, 5.9x and 5.6x faster than Spark [125], Bosen PS [101], TensorFlow [4], and MXNet [20], respectively. With 34 or 40 machines, our system is able to train ML models with 1-7 billion parameters in 1-4 hours. (2) Our system scales well with more machines. It achieves 30.4x speedup with 34 CPU machines and 35.4x speedup with 40 GPU machines ML for Healthcare In the third part of this thesis, we design ML models and apply the diversity-promoting and large-scale learning techniques developed in the first two parts to address critical problems in healthcare. We made the following contributions. We study the prediction of discharge medications at admission time for treatment planning, by proposing a deep determinantal point process (DDPP) model. This model seamlessly integrates a determinantal point process for capturing high-order correlation among medications and a deep neural network for learning representations of electronic health records. To incorporate the prior knowledge of drug interactions for better prediction, we propose a relational DDPP model. Evaluations on 8 antihypertensive medications and 52K patient-visits demonstrate the effectiveness of the proposed methods. 5

11 We study how to automatically generate textual reports from medical images to help physicians improve quality and efficiency of writing reports, by designing a multi-task hierarchical model with co-attention. The evaluation is conducted on 7.5K radiology and 7.4K pathology images, where we demonstrate that the proposed model can effectively find out the abnormal regions and generate high-quality reports. We study the automatic filling of ICD codes based on physicians free-form diagnosis descriptions, to reduce coding errors and costs, by designing an attentional matching model. Evaluation on 12K patient-visits and 50 ICD codes demonstrate the effectiveness of our method. We apply the diversity-promoting distance metric learning model to achieve fast and accurate retrieval of similar patients that aid timely and informative clinical decision-making. By promoting diversity, the dimension of latent representations (accordingly, the time complexity of retrieval) can be reduced, without sacrificing the retrieval accuracy. On two electronic health records datasets, we demonstrate that compared with no regularization, diversity-promoting regularization achieves better retrieval performance with faster retrieval speed. We implement a distributed topic model (TM) on the Orpheus system and apply it to largescale medical-topic discovery. Leveraging the sufficient factor (SF) property of the TM, Orpheus performs SF broadcasting and SF-aware multiplication to significantly reduce communication and computation costs. Using 34 CPU machines, the Orpheus-TM is able to learn 50K topics from 8.2M PubMed documents (vocabulary size is 141K) in 4.2 hours Proposed In our proposed works, we plan to achieve the following. The (relational) DDPP model designed for discharge medication prediction performs well on frequent medications but less well on infrequent medications. To improve the performance on infrequent drugs, we apply diversity-promoting regularization, which has the ability of better capturing infrequent patterns, to the DDPP model. In the automation generalization of medical-imaging reports, we apply diversity-promoting regularization to enhance the interpretability of the model. In automatic ICD coding, to alleviate overfitting, we apply diversity-promoting regularization on the coding model. 1.3 Timeline The tentative timeline is as follows. Jan 2018, development and evaluate the diversity-promoting DDPP model; Feb 2018: development and evaluate the diversity-promoting report-generation model; Mar 2018: development and evaluate the diversity-promoting ICD-coding model; Jan-Mar, 2018: thesis writing; Mar 2018: defense. 6

12 Chapter 2 Diversity-Promoting Learning I Regularization In this chapter, we study diversity-promoting learning in the context of frequentist statistics, by developing regularization techniques. 2.1 Uncorrelation and Evenness: a Diversity-Promoting Regularizer We start with formally defining diversity [113]. Gaining inspiration from principal component analysis [47], biological diversity [64] and information theory [23], we characterize diversity by considering two factors: uncorrelation and evenness. Uncorrelation is a measure of how uncorrelated the components are. Literally, less correlation is equivalent to more diversity. Evenness is borrowed from biological diversity [64], which measures how equally important different species are in maintaining the ecological balance within an ecosystem. If no species dominate another, the ecosystem is deemed as more diverse. Likewise, in latent space modeling, we desire the components to play equally important roles and no one dominates another, such that each component contributes significantly to the modeling of data Uniform Eigenvalue Regularizer We characterize the uncorrelation among components from a statistical perspective: treating the components as random variables and measuring their covariance which is proportional to their correlation. Let A R d m denote the component matrix whose k-th column is the parameter vector a k of component k. Alternatively, we can take a row view of A: each component is treated as a random variable and each row vector ã i can be seen as a sample drawn from the random vector formed by the m components. Let µ = 1 d d i=1 ãi = 1 d A 1 be the sample mean, where the elements of 1 R d are all 1. We compute the empirical covariance matrix of the components as G = 1 d d i=1 (ã i µ)(ã i µ) = 1 d A A ( 1 d A 1)( 1 d A 1). Imposing the constraint A 1 = 0, we have G = 1 d A A. Suppose A is a full rank matrix and m < d, then G is a full-rank matrix with rank m. 7

13 For the next step, we show that the eigenvalues of G play important roles in characterizing the uncorrelation and evenness of components. We start with uncorrelation. Let G = m k=1 λ ku k u k be the eigendecomposition where λ k is an eigenvalue and u k is the associated eigenvector. As is well known in Principle Component Analysis [47], an eigenvector u k of the covariance matrix G represents a principal direction of the data points and the associated eigenvalue λ k tells the variability of points along that direction. The larger λ k is, the more spread out the points along the direction u k. When the eigenvectors (principal directions) are not aligned with the coordinate axis, the level of disparity among eigenvalues indicates the level of correlation among the m components (random variables). The more different the eigenvalues are, the higher the correlation is. In light of this, we would utilize the uniformity among eigenvalues of G to measure how uncorrelated the components are. Secondly, we relate the eigenvalues with the other factor of diversity: evenness. When the eigenvectors are aligned with the coordinate axis, the components are uncorrelated. In this case, we bring in evenness to measure diversity. As stated earlier, we first need to assign each component an importance score. Since the eigenvectors are in parallel to the coordinate axis, the eigenvalues reflect the variance of components. Analogous to PCA which posits that random variables with larger variance are more important, we use variance to measure importance. According to the evenness criteria, the components are more diverse if their importance scores match, which motivates us to encourage the eigenvalues to be uniform. To sum up, we desire to encourage the eigenvalues to be even in both cases: (1) when the eigenvectors are not aligned with the coordinate axis, they are preferred to be even to reduce the correlation of components; (2) when the eigenvectors are aligned with the coordinate axis, they are encouraged to be even such that different components contribute equally in modeling data. Next, we discuss how to promote uniformity among eigenvalues. The basic idea is: we normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small Kullback-Leibler (KL) divergence with the uniform distribution. Given the eigenvalues {λ k } m k=1, we first normalize them into a probability simplex λ k = based on which we define a distribution on a discrete random variable X = λ k m j=1 λ j 1,, m where p(x = k) = λ k. In addition, to guarantee the eigenvalues are strictly positive, we require A A to be positive definite. To encourage { λ k } m k=1 to be uniform, we encourage the distribution p(x) to be close to a uniform distribution q(x = k) = 1, where the closeness m is measured using KL divergence KL(p q): m λ k=1 k log λ k = m k=1 λ k log λ m k log m 1/m j=1 λ j j=1 λ j + log m. In this equation, m k=1 λ k log λ k is equivalent to tr(( 1 d A A)log( 1 d A A)), where log( ) denotes matrix logarithm. To show this, note that log( 1 d A A) = m k=1 log(λ k)u k u k, according to the property of matrix logarithm. Then we have tr(( 1 d A A) log( 1 d A A)) equals to tr(( m k=1 λ ku k u k )( m k=1 log(λ k)u k u k )) which equals to m k=1 λ k log λ k. According to the property of trace, we have tr( 1 d A A) = m k=1 λ k. Then the KL divergence can be turned into a diversity-promoting uniform eigenvalue regularizer (UER): subject to A A 0 and A 1 = 0. tr(( 1 d A A) log( 1 d A A)) tr( 1 log tr( 1 d A A) d A A), (2.1) 8

14 We apply UER to promote diversity. Let L(A) denote the objective function of an ML model, then a UE-regularized ML problem can be defined as min A L(A) + λ( tr(( 1 d A A) log( 1 d A A)) log tr( 1 tr( 1 d A A) d A A)) s.t. A 1 = 0, A A 0 where λ is the regularization parameter Case Study: Distance Metric Learning In this section, we apply the uniform eigenvalue regularizer to promote diversity in a specific model: distance metric learning (DML). Given data pairs either labeled as similar or dissimilar, DML [25, 38, 118] aims to learn a distance metric under which similar pairs would be placed close to each other and dissimilar pairs are separated apart. The learned distance can benefit a wide range of tasks, including retrieval, clustering and classification. Following [102], we define the distance metric between x, y R d as A x A y 2 2 where A R d m is a parameter matrix whose column vectors are components. Built upon the DML formulation in [106], an uniformeigenvalue regularized DML (UE-DML) problem can be formulated as: min A (x,y) S A x A y 2 2+ (x,y) D max(0, 1 A x A y 2 2)+λ( tr(( 1 d A A) log( 1 d A A)) log tr( 1 tr( 1 d A A) d A A)) subject to A 1 = 0, A A 0, where S and D are the set of similar and dissimilar pairs respectively. The first and second term in the objective function encourage similar pairs to have small distances and dissimilar pairs to have large distances respectively Evaluation We applied UE-DML to an electronic health record dataset MIMIC-III [46] and two image datasets Stanford-Cars [49] and Caltech-UCSD-Birds [103]. Two samples are labeled as similar if belonging to the same class and dissimilar otherwise. The learned distance metrics are applied for retrieval whose performance is evaluated using precision@k. We compare with two sets of regularizers: (1) diversity-promoting regularizers based on determinant of covariance (DC) [65], cosine similarity (CS) [123], determinantal point process (DPP) [50, 129], InCoherence (IC) [8], mutual angles (MA) [107], and decorrelation (DeCov) [22]; (2) regularizers that are designed for other purposes, including L2 norm for small norm, L1 norm for sparsity, low-rankness [80] and Dropout [87]. Table 2.1 shows the retrieval precision (K = 10) on three datasets, where we observe: (1) UE-DML achieves much better precision than DML, proving that UER is an effective regularizer in improving generalization performance; (2) UER outperforms other diversity-promoting regularizers possibly due to its capability to capture global relations among all components and insensitivity to vector scaling; (3) diversity-promoting regularizers perform better than other types of regularizers such as L2, L1, low rank, and Dropout, demonstrating the efficacy of inducing diversity. Table 2.2 shows the number of components that achieve the precision in Table 2.1. Compared with DML, UE-DML uses much fewer components to achieve better precision. For example, on 9

15 MIMIC Cars Birds DML 72.5 ± ± ± 0.5 DML-L ± ± ± 0.4 DML-L ± ± ± 0.2 DML-LowRank 72.5 ± ± ± 0.6 DML-Dropout 73.1 ± ± ± 0.3 DML-DC 73.7 ± ± ± 0.4 DML-CS 73.5 ± ± ± 0.2 DML-DPP 74.2 ± ± ± 0.7 DML-IC 74.3 ± ± ± 0.2 DML-MA 73.6 ± ± ± 0.1 DML-DeCov 72.6 ± ± ± 0.8 DML-UE 75.4 ± ± ± 0.2 Table 2.1: (%) on three datasets. The Cars dataset has a single train/test split, hence the standard error is 0. the Cars dataset, UE-DML achieves a 58.2% precision with 100 components. In contrast, with more components (300), DML achieves a much lower precision (53.1%). This demonstrates that by encouraging the components to be diverse, UER is able to reduce model size without sacrificing modeling power. UER encourages equal importance among components such that each component plays a significant role in modeling data. As a result, it suffices to use a small number of components to achieve larger modeling power. Compared with other diversity-promoting regularizers, UER achieves better precision with fewer components, demonstrating its ability to better promote diversity. Next, we verify whether diversifying the components in DML can better capture infrequent patterns. In the MIMIC-III dataset, we consider diseases as patterns and consider a disease as frequent if more than 1000 hospital admissions are diagnosed with this disease and infrequent if otherwise. Table 2.3 shows the retrieval precision on frequent diseases and infrequent diseases. As can be seen, compared with the baselines, UE-DML achieves more improvement on infrequent diseases than on frequent diseases. This indicates that by encouraging the components to diversely spread out, UER is able to better capture infrequent patterns (diseases in this case) without compromising the performance on frequent patterns. On infrequent diseases, UE-DML outperforms other diversity-promoting methods, showing the advantage of UER over other diversity-promoting regularizers. 2.2 Convex Diversity-Promoting Regularizers The UE regularizer is nonconvex and is difficult to be convexified. As a result, the UE-regularized ML problems are nonconvex where achieving the global optimal is NP-hard. In this section, we design new diversity-promoting regularizers that make convex relaxation easy. We begin with defining nonconvex regularizers based on Bregman matrix divergence, then discuss how to convexify them. 10

16 MIMIC Cars Birds Average DML DML-L DML-L DML-LowRank DML-Dropout DML-DC DML-CS DML-DPP DML-IC DML-MA DML-DeCov DML-UE Table 2.2: Optimal number of components Nonconvex Bregman Matrix Divergence Regularizers We first introduce another measure of diversity near-orthogonality [115]: the component vectors are deemed to be more diverse if they are closer to being orthogonal. To encourage nearorthogonality between two vectors a i and a j, one way is to make their inner product a i a j close to zero and their l 2 norm a i 2, a j 2 close to one. For a set of vectors {a i } m i=1, their nearorthogonality can be achieved in the following manner: computing the Gram matrix G where G ij = a i a j, then encouraging G to be close to an identity matrix. Off the diagonal of G and I are a i a j and zero respectively. On the diagonal of G and I are a i 2 2 and one respectively. Making G close to I effectively encourages a i a j to be close to zero and a i 2 close to one, which therefore encourages a i and a j to get close to orthogonal. We use Bregman matrix divergence (BMD) [52] to measure the closeness between two matrices. Let S n denote real symmetric n n matrices. Given a strictly convex, differentiable function φ : S n R, the BMD is defined as D φ (X, Y) = φ(x) φ(y) tr(( φ(y)) (X Y)), where tr(a) denotes the trace of matrix A. Different choices of φ(x) lead to different divergences. When φ(x) = X 2 F, BMD is specialized to the squared Frobenius norm (SFN) X Y 2 F. If φ(x) = tr(x log X X), where log X denotes the matrix logarithm of X, the divergence becomes D vn (X, Y) = tr(x log X X log Y X + Y), which is referred to as von Neumann divergence (VND) [91]. If φ(x) = log det X where det(x) denotes the determinant of X, we get the log-determinant divergence (LDD) [52]: D ld (X, Y) = tr(xy 1 ) log det(xy 1 ) n. To encourage near-orthogonality among components, we encourage the BMD between their Gram matrix AA and an identity matrix I to be small, which results in a family of BMD regularizers: Ω φ (A) = D φ (AA, I). Ω φ (A) can be specialized to different instances, according to the choices of D φ (, ). Under SFN, Ω φ (A) becomes Ω F ro (A) = AA I 2 F. Under VND, Ω φ (A) becomes Ω vn (A) = tr(aa log(aa ) AA ) + m. Under LDD, Ω φ (A) becomes Ω ld (A) = tr(aa ) log det(aa ) m. Applying these regularizers to distance metric learning (DML), we define the following 11

17 Frequent Infrequent DML 77.6 ± ± 0.3 DML-L ± ± 0.1 DML-L ± ± 0.8 DML-LowRank 77.7 ± ± 0.8 DML-Dropout 78.1 ± ± 0.4 DML-DC 77.9 ± ± 0.2 DML-CS 78.0 ± ± 0.7 DML-DPP 77.3 ± ± 0.5 DML-IC 78.5 ± ± 0.2 DML-MA 76.8 ± ± 0.4 DML-DeCov 77.1 ± ± 0.1 UE-DML 78.3 ± ± 0.4 Table 2.3: (%) on frequent and infrequent diseases of the MIMIC-III dataset. 1 BMD-regularized DML (BMD-DML) problem: min A S (x,y) S Ax Ay D (x,y) D max(0, 1 Ax Ay 2 2) + λω φ (A), which is nonconvex Convex Bregman Matrix Divergence Regularizers Next, we discuss how to relax the nonconvex BMD regularizers into convex functions. The relaxations are based on the properties of eigenvalues. Given a full-rank matrix A R m d (m < d), we know that AA R m m is a full-rank matrix with m positive eigenvalues (denoted by λ 1,, λ m ) and A A R d d is a rank-deficient matrix with d m zero eigenvalues and m positive eigenvalues that equal to λ 1,, λ m. For a general positive definite matrix Z R m m whose eigenvalues are γ 1,, γ m, we have Z 2 F = m j=1 γ2 j, tr(z) = m j=1 γ j and log det Z = m j=1 log γ j. Next, we leverage these facts to seek convex approximations of the BMD regularizers. Convex SFN Regularizer The eigenvalues of AA I m are λ 1 1,, λ m 1 and those of A A I d are λ 1 1,, λ m 1, 1,, 1. According to the fact Z 2 F = m j=1 γ2 j, we have A A I d 2 F = m j=1 (λ j 1) 2 + d j=m+1 ( 1)2 = AA I m 2 F + d m. Let M denote A A, then the SFN regularizer AA I m 2 F equals to A A I d 2 F d + m = M I d 2 F d + m, where m = rank(a A) = rank(m). It is well-known that the trace norm of a matrix is a convex envelope of its rank [86]. We use tr(m) to approximate rank(m) and get AA I m 2 F M I d 2 F + tr(m) d, where the right hand side is a convex function. Dropping the constant, we get the convex SFN (CSFN) regularizer defined over M Ω F ro (M) = M I d 2 F + tr(m) (2.2) Convex VND Regularizer Given AA = UΛU where Λ jj = λ j, according to the property of matrix logarithm, we have log(aa ) = U ΛU where Λ jj = log Λ jj. Then (AA ) log(aa ) 12

18 (AA ) = U(Λ Λ Λ)U, where the eigenvalues are {Λ jj log Λ jj Λ jj } m j=1. Since tr(m) = m j=1 λ j, we have Ω vn (A) = m j=1 (Λ jj log Λ jj Λ jj ) + m. Now we consider a matrix A A + ɛi d, where ɛ > 0 is a small scalar. Using similar calculation, we have D vn (A A + ɛi d, I d ) = m j=1 ((λ j + ɛ) log(λ j + ɛ) (λ j + ɛ)) + (d m)(ɛ log ɛ ɛ) + d. Performing certain algebra, we get Ω vn (A) D vn (A A + ɛi d, I d ) + m d. Replacing A A with M, approximating m with tr(m) and dropping constant d, we get the convex VND (CVND) regularizer: Ω vn (M) = D vn (M + ɛi d, I d ) + tr(m) tr((m + ɛi d ) log(m + ɛi d )) (2.3) whose convexity is shown in [71]. Convex LDD Regularizer Since tr(m) = m j=1 λ j and log det M = m j=1 log λ j, we have Ω ld (A) = m j=1 λ j m j=1 log λ j m and D ld (A A + ɛi d, I d ) = m j=1 λ j + dɛ (d m) log ɛ m j=1 log(λ j + ɛ). Certain algebra over these two equations shows that Ω ld (A) D ld (A A + ɛi d, I d ) (1 + log ɛ)m + d log ɛ. Replacing A A with M, approximating m with tr(m) and discarding constants, we obtain the convex LDD (CLDD) regularizer: Ω ld (M) = D ld (M + ɛi d, I d ) (1 + log ɛ)tr(m) logdet(m + ɛi d ) + (log 1 )tr(m) (2.4) ɛ where the convexity of logdet(m + ɛi d ) is proved in [17]. In [25, 78], an information theoretic regularizer based on log-determinant divergence D ld (M, I) = logdet(m)+tr(m) is applied to encourage the Mahalanobis matrix to be close to the identity matrix. This regularizer requires M to be full rank while our convex LDD regularizer encourages M to be low-rank by associating a large weight log 1 ɛ to the trace norm tr(m). Since M = A A, reducing the rank of M effectively reduces the number of projection vectors in A. DML with Convex BMD Regularization Given these convex BMD (CBMD) regularizers (denoted by Ω φ (M) uniformly), we can relax the nonconvex BMD-DML problems into convex CBMD-DML formulations by replacing Ax Ay 2 2 with (x y) M(x y) and replacing the nonconvex BMD regularizers Ω φ (A) with Ω φ (M): min M 1 S (x,y) S (x y) M(x y) + 1 D (x,y) D max(0, 1 (x y) M(x y)) + λ Ω φ (M) subject to M 0. This convex problem facilitates optimization: the global optimal is guaranteed to be achievable A Proximal Gradient Descent Algorithm We use stochastic proximal subgradient descent algorithm [75] to solve the CBMD-DML problems. The algorithm iteratively performs the following steps until convergence: (1) stochastic subgradient descent: M = M η M; (2) proximal operation. The proximal operators associated the regularizer Ω φ (M) are derived by minimizing 1 2η M M λ Ω φ (M) subject to M 0. Let { λ j } d j=1 be the eigenvalues of M and {x j } d j=1 be the eigenvalues of M, then this 13

19 problem can be equivalently written as: min {xj } d j=1 1 2η d (x j λ j ) 2 + λ d h φ (x j ) j=1 j=1 s.t. j = 1,, d, x j 0 (2.5) where h φ (x j ) is a regularizer-specific scalar function. Further, this problem can be decomposed into d independent problems: (a) min xj f(x j ) = 1 (x 2η j λ j ) 2 + λh φ (x j ) subject to x j 0, for j = 1,, d, which can be solved individually. SFN For SFN where Ω φ (M) = M I d 2 F + tr(m) and h sfn(x j ) = (x j 1) 2 + x j, problem (a) is simply a quadratic programming problem. The optimal solution is x j = max(0, λ j +ηλ ) 1+2ηλ VND For VND where Ω φ (M) = tr((m+ɛi d ) log(m+ɛi d )) and h φ (x j ) = (x j +ɛ) log(x j +ɛ), by taking the derivative of the objective function f(x j ) in problem (a) w.r.t x j and setting the derivative to zero, we get ηλ log(x j + ɛ) + x j + ηλ λ j = 0. The root of this equation is: ηλω( ɛ ηλ+ λ j log(ηλ)) ɛ, where ω( ) is the Wright omega function [36]. If this root is ηλ negative, then the optimal x j is 0; if this root is positive, then the optimal x j could be either this root or 0. We pick the one that yields the lowest f(x j ). Formally, x j = argmin xj f(x j ), where x {max(ηλω( ɛ ηλ+ λ j ηλ log(ηλ)) ɛ, 0), 0}. LDD For LDD where Ω φ (M) = logdet(m + ɛi d ) + (log 1 ɛ )tr(m) and h φ(x j ) = log(x j + ɛ) + x j log 1 ɛ, by taking the derivative of f(x j) w.r.t x j and setting the derivative to zero, we get a quadratic equation: x 2 j + ax j + b = 0, where a = ɛ λ j ηλ log ɛ and ηλ(1 ɛ log ɛ). The optimal solution is achieved either at the positive roots (if any) of this equation or 0. We pick the one that yields the lowest f(x j ). Formally, x j = argmin xj f(x j ), where x {max( b+ b 2 4ac, 0), max( b b 2 4ac, 0), 0}. 2a 2a Evaluation In this section, we present experimental results on regularized distance metric learning, which demonstrate that compared with nonconvex BMD regularizers, the proposed convex regularizers are computationally more efficient, and are able to better capture infrequent patterns. We used six datasets in the experiments: two electronic health record datasets MIMIC (version III) [46] and EICU (version 1.1) [35]; two text datasets Reuters and 20-Newsgroups (News); two image datasets Stanford-Cars (Cars) [49] and Caltech-UCSD-Birds (Birds) [103]; one sensory dataset 6-Activities (Act) [6]. The learned distance metrics are applied for retrieval whose performance is evaluated using AUC [66] the area under precision-recall curve (the higher, the better). Table 2.4 shows the training time taken by different methods to reach convergence. For nonconvex BMD-DML methods, we report the total time taken by the following computation: tuning the regularization parameter (4 choices) and the number of component vectors (6 choices) 14

20 BMD-DML CBMD-DML January 4, 2018 MIMIC EICU Reuters News Cars Birds Act SFN VND LDD CSFN CVND CLDD Table 2.4: Training time (hours) on seven datasets. MIMIC EICU Reuters News Cars Birds Act All IFC FC All IFC FC All IFC FC All All All All SFN VND LDD CSFN CVND CLDD Table 2.5: Retrieval AUC. on a two-dimensional grid via 3-fold cross validation (4 6 3 = 72 experiments in total); for each of the 72 experiments, the algorithm restarts 5 times, each with a different initialization, and picks the one yielding the lowest objective value. In total, the number of runs is 72 5 = 360. For CBMD-DML methods, there is no need to tune the number of component vectors or restart. The total number of runs is 4 3 = 12. As can be seen from the table, CBMD-DML methods are much faster than BMD-DML methods. CBMD-DML requires no multiple-restarts or tuning of component-vector number, hence greatly reduces the number of experimental runs. For every single run, CBMD-DML is less efficient than BMD-DML due to the overhead of eigen-decomposition. We verify whether the proposed convex BMD regularizers are able to better capture infrequent patterns. On three datasets MIMIC, EICU and Reuters where the classes (treated as patterns) are imbalanced, we label a class as frequent if it contains more than 1000 examples and infrequent otherwise. We measure AUCs on all classes (AUC-All), infrequent classes (AUC- IFC) and frequent classes (AUC-FC). Table 2.5 shows the AUC-All, AUC-IFC, and AUC-FC on the three datasets, from which we make the following observations. First, on the infrequent classes (IFC) of the three imbalanced datasets, CVND and CLDD achieve better AUCs than the nonconvex BMD regularizers, which demonstrates the two convex regularizers are better at capturing infrequent patterns, possibly because they are able to achieve the global optimal solution whereas the nonconvex ones can only achieve a local (hence inferior) optimal. Second, though convex, CSFN performs less well than CVND and CLDD, possibly because the latter two measure divergence in a global manner while CSFN performs that in a pairwise fashion. Third, in terms of AUC on all classes (All), CVND and CLDD also outperform the nonconvex regularizers. 15

21 2.3 Angular Constraints for Improving Generalization Performance In this section, we study how diversity-promoting regularization can alleviate overfitting, both theoretically and empirically. In theory, we need to analyze how such regularization affects the generalization error (which is the sum of estimation and approximation errors) on unseen data. The uniform eigenvalue regularizer and the Bregman divergence regularizers studied in the previous two sections are not amenable for such analysis, especially for the approximation errors. In light of this, we propose a new diversity-promoting regularization approach: angular constraint [110], which is empirically effective and theoretically analyzable Angular Constraints Similar to the BMD regularizers, angular constraints (ACs) use near-orthogonality to characterize diversity and encourages the angles between component vectors to be close to π/2. The ACs are defined as requiring the absolute value of cosine similarity between each pair of components to be less equal to a small value τ, which leads to the following angle-constrained problem min A L(A) s.t. 1 i < j m, a i a j a i 2 a j 2 τ (2.6) where A = {a i } m i=1 denotes the component vectors and L(A) is the objective function of this problem. The parameter τ controls the level of near-orthogonality (or diversity). A smaller τ indicates that the vectors are more close to being orthogonal, and hence are more diverse. As will be shown later, representing diversity using the angular constraints facilitates theoretical analysis and is empirically effective as well. Case Study: Neural Networks In a neural network (NN) with L hidden layers, each hidden layer l is equipped with m (l) units and each unit i is connected with all units in layer l 1. Hidden unit i at layer l is parameterized by a weight vector a (l) i. These hidden units aim at capturing latent features underlying data. Applying ACs to the weight vectors of hidden units, we obtain the AC-NN problem An ADMM-based Algorithm In this section, we develop an ADMM-based algorithm to solve the AC regularized problem. To make it amenable for optimization, we first factorize each weight vector a into its l 2 norm g = a 2 and direction ã = a a 2. Under such a factorization, a can be reparameterized as a = gã, where g > 0 and ã 2 = 1. Then the problem defined in Eq.(2.6) can be transformed into: (a) minã,g L(Ã, G) subject to i, g i 0, ã j 2 = 1 and i j, ã i ã j τ, where Ã = {ã j} m j=1 and G = {g j } m j=1. We solve this new problem by alternating between Ã and G. Fixing Ã, the problem defined over G can be solved using projected gradient descent. Fixing G, the subproblem defined over Ã can be solved using an ADMM algorithm. There are R = m(m 1) 16

22 pairwise-constraints ã i ã j τ. For the r-th constraint, let p(r) and q(r) be the index of the first and second vector respectively, i.e., the r-th constraint is ã p(r) ã q(r) τ. First, we introduce auxiliary variables {v (r) 1 } R r=1 and {v (r) 2 } R r=1, to rewrite problem (a) into an equivalent form. For each pairwise constraint: ã p(r) ã q(r) τ, we introduce two auxiliary vectors v (r) 1 and v (r) 2, and let ã p(r) = v (r) 1, ã q(r) = v (r) 2, v (r) 1 2 = 1, v (r) 2 2 = 1, v (r) 1 v (r) 2 τ. To this end, we obtain the following problem: minã,v L(Ã), subject to j, ã j 2 = 1; r, ã p(r) = v (r) 1, ã q(r) = v (r) 2 ; r, v (r) 1 2 = 1, v (r) 2 2 = 1, v (r) 1 v (r) 2 τ, where V = {(v (r) 1, v (r) 2 )} R r=1. Then we define the augmented Lagrangian, with Lagrange multipliers Y = {(y (r) 1, y (r) 2 )} R r=1 and parameter ρ: L(Ã) + R (y (r) 1 (ã p(r) v (r) 1 ) + y (r) 2 (ã q(r) v (r) 2 ) + ρ ã 2 p(r) v (r) ρ ã 2 q(r) v (r) 2 2 2), r=1 subject to j, ã j 2 = 1 and r, v (r) 1 2 = 1, v (r) 2 2 = 1, v (r) 1 v (r) 2 τ, which can be solved by alternating between Ã, V, Y. Solve Ã The sub-problem defined over Ã can be solved using projected gradient descent. Solve v (r) 1, v (r) 2 We minimize y (r) 1 v (r) 1 y (r) 2 v (r) 2 + ρ ã 2 p(r) v (r) ρ ã 2 q(r) v (r) under the constraints: v (r) 1 2 = 1, v (r) 2 2 = 1, v (r) 1 v (r) 2 τ, v (r) 1 v (r) 2 τ. Let γ 1, γ 2, λ 1 0, λ 2 0 be the KKT multipliers associated with the four constraints in this sub-problem. According to the KKT conditions, we have y (r) 1 + ρ(v (r) 1 ã p(r) ) + 2γ 1 v (r) 1 + (λ 1 λ 2 )v (r) 2 = 0 (2.7) y (r) 2 + ρ(v (r) 2 ã q(r) ) + 2γ 2 v (r) 2 + (λ 1 λ 2 )v (r) 1 = 0 (2.8) We solve these two equations by examining four cases: (1) λ 1 = 0, λ 2 = 0; (2) λ 1 > 0, λ 2 = 0; (3) λ 1 = 0, λ 2 > 0; (4) λ 1 > 0, λ 2 > 0. We discuss how to deal with case (2) here. Since λ 1 > 0 and λ 2 = 0, we have: (a) (ρ + 2γ 1 )v (r) 1 + λ 1 v (r) 2 = y (r) 1 + ρã p(r) and (b) (ρ + 2γ 2 )v (r) 2 + λ 1 v (r) 1 = y (r) 2 + ρã q(r). According to the complementary slackness condition, we know v (r) 1 v (r) 2 = τ. For the vectors on both sides of equation (a), taking the square of their l 2 norm, we get (c) (ρ + 2γ 1 ) 2 + λ (ρ + 2γ 1 )λ 1 τ = y (r) 1 + ρã p(r) 2 2. Similarly, from equation (b), we get (d) (ρ + 2γ 2 ) 2 + λ (ρ + 2γ 2 )λ 1 τ = y (r) 2 + ρã q(r) 2 2. Taking the inner product of the two vectors on the left hand sides of equation (c) and (d), and that on the right hand sides, we get Regularization Error No-Reg CS IC MA DC AC Table 2.6: Phone error rate (%) on the TIMIT test set. (e) (2ρ + 2γ 1 + 2γ 2 )λ 1 + ((ρ + 2γ 1 )(ρ + 2γ 2 ) + λ 2 1)τ = (y (r) 1 + ρã p(r) ) (y (r) 2 + ρã q(r) ). Solving the system of equations consisting of equation (c-e), we obtain the optimal values of γ 1, γ 2 and λ 1. Plugging them into equation (a) and (b), we obtain a solution of v (r) 1 and v (r) 2. Then we check whether this solution satisfies v (r) 1 v (r) 2 τ. If so, this is an optimal solution. 17

23 2.3.3 Evaluation January 4, 2018 We evaluate AC on two types of neural networks. Feedforward NN for phone recognition The NN architecture follows that in the Kaldi [77] toolkit. The experiments were conducted on the TIMIT dataset. We compared with four diversitypromoting regularizers: CS [123], IC [8], MA [107] and DeCorrelation (DC) [22]. Table 2.6 shows the phone error rate (PER) on the TIMIT core test set. Without regularization (No-Reg), the error is 18.53%. With AC, the error is reduced to 18.41%. AC outperforms other regularizers. Convolutional NN for image classification The experiments Regularization Error No-Reg 3.89 CS 3.81 IC 3.85 MA 3.68 DC 3.77 OP 3.69 AC 3.63 were performed on the CIFAR-10 dataset. We used the wide Table 2.7: Classification error (%) on CIFAR-10 test set residual network [124] architecture. We compared with CS [123], IC [8], MA [107], DC [65] and an orthogonality-promoting (OP) regularizer [81]. Table 2.7 shows the classification error on the test set. Compared with no-regularization (No-Reg) which achieves an error of 3.89%, applying AC reduces the error to 3.63%. AC achieves lower error than other regularizers. 18

24 Chapter 3 Diversity-Promoting Learning II Bayesian Inference In the last chapter, we have studied diversity-promoting learning under a frequentist-style regularization framework, where the component vectors are learned via point estimation [98]. In this chapter, we study how to promote diversity under an alternative learning paradigm: Bayesian inference [13, 44, 69], where the components are considered as random variables of which a posterior distribution shall be computed from data under certain priors. Compared with point estimation, Bayesian learning offers complementary benefits. First, it offers a model-averaging [13, 44] effect for ML models when they are used for decision-making and prediction because the parameters shall be integrated under a posterior distribution, thus potentially alleviate overfitting on training data. Second, it provides a natural way to quantify uncertainties of model parameters, and downstream decisions and predictions made thereupon [13, 44, 69]. Affandi et al. [5] investigated the diversification of Bayesian models using the determinantal point process (DPP) [50] prior. DPP has two drawbacks. First, it is not applicable to Bayesian nonparametric models where the number of components is infinite. Second, it is not amenable to developing variational-inference [94] based posterior inference algorithms. To address these issues, we develop a mutual angular process [109] that encourages the component vectors to have large mutual angles and extend it to an infinite mutual angular process [117] that encourages infinitely many components to be diverse. These stochastic processes facilitate the development of both variational inference and Markov chain Monte Carlo sampling algorithms. 3.1 Diversity-Promoting Learning of Bayesian Parametric Models We start with promoting diversity in Bayesian parametric models where the number of components is finite and fixed. In next section, we extend the study to Bayesian nonparametric models which have infinitely many components. 19

25 3.1.1 Mutual Angular Process January 4, 2018 We first define a prior which has an inductive bias towards components that are more diverse and use it to affect the posterior via Bayes rule. Following [107] we adopt the notion that a set of component vectors are considered to be more diverse if the pairwise angles between them are larger. We desire the prior to have two traits. First, to favor diversity, they assign a higher density to components having larger mutual angles. Second, it should facilitate posterior inference. In Bayesian learning, the easiness of posterior inference relies heavily on the prior [14, 95]. Here we define a mutual angular process that possesses the aforementioned two traits, based on Bayesian network [48] and von Mises-Fisher [67] distribution. For technical convenience, we decompose each real-valued component vector a into a = gã, where g = a 2 is the magnitude and ã is the direction ( ã 2 = 1). Let Ã = {ã i} K i=1 denote the directional vectors. Note that the angle between two vectors is invariant to their magnitudes, thereby, the mutual angles of component vectors in A are the same as angles of directional vectors in Ã. We first construct a random process which prefers vectors in Ã to possess large angles. The basic idea is to use a Bayesian network (BN) to characterize the dependency among directional vectors and design local probabilities to entail an inductive bias towards large mutual angles. In the BN shown in Figure 3.1, each node i represents a directional vector ã i and its parents pa(ã i ) are nodes 1,, i 1. We define a local probability at node i to encourage ã i to have large mutual angles with ã 1,, ã i 1. Since these directional vectors lie on a sphere, we use the von Mises-Fisher (vmf) distribution to model them. The probability density function of the vmf distribution is f(x) = C p (κ) exp(κµ x), where the random variable x R p lies on a p 1 dimensional sphere ( x 2 = 1), µ is the mean direction with µ 2 = 1, κ > 0 is the concentration parameter and C p (κ) is the normalization constant. The local probability p(ã i pa(ã i )) at node i is defined as a von Mises-Fisher (vmf) distribution whose density is p(ã i pa(ã i )) = C p (κ) exp κ ( i 1 j=1 ãj i 1 j=1 ãj 2 ) ã i (3.1) with mean direction i 1 j=1 ãj/ i 1 j=1 ãj 2. Now we explain why this local probability favors large mutual angles. Since ã i and ã j are unit-length vectors, ã j ã i is the cosine of the angle between ã i and ã j. If ã i has larger angles with {ã j } i 1 j=1, then the average negative cosine similarity ( i 1 j=1 ãj) ã i would be larger, accordingly p(ã i pa(ã i )) would be larger. This statement is true for all i > 1. As a result, p(ã) = p(ã 1) K i=2 p(ã i pa(ã i )) would be larger if the directional vectors have larger mutual angles. For the magnitudes {g i } K i=1 of the components, which have nothing to do with the mutual angles, we sample g i for each component independently from a gamma distribution with shape parameter α 1 and rate parameter α 2. The generative process of A is summarized as follows: Draw ã 1 vmf(µ 0, κ) i 1 j=1 ãj i 1 For i = 2,, K, draw ã i vmf(, κ) j=1 ãj 2 For i = 1,, K, draw g i Gamma(α 1, α 2 ) For i = 1,, K, let a i = ã i g i 20

26 a1 a2 a 3... a K Figure 3.1: A Bayesian Network Representation of the Mutual Angular Process The probability distribution over A can be written as p(a) = C p (κ) exp(κµ 0 ã 1 ) ( ( K i=2 C p(κ) exp κ i 1 j=1 ãj i 1 j=1 ãj 2 ) ã i ) K α α 1 2 gα 1 1 i e g i α 2 i=1 Γ(α 1 ). (3.2) According to the factorization theorem [48] of Bayesian network, it is easy to verify p(a)da = A 1, thus p(a) is a proper prior. When inferring the posterior of model components using a variational inference method, we need to compute the expectation of 1/ i 1 j=1 ãj 2 appearing in the local probability p(ã i pa(ã i )), which is extremely difficult. To address this issue, we define an alternative local probability that achieves similar modeling effect as p(ã i pa(ã i )), but greatly facilitates variational inference. We re-parametrize the local probability ˆp(ã i pa(ã i )) defined in Eq.(3.1) using Gibbs measure: ) ( ) i 1 i 1 ˆp(ã i pa(ã i )) = C p (κ ã j 2 exp κ( ã j ) ã i, (3.3) j=1 which is another vmf distribution with mean direction i 1 j=1 ãj/ i 1 j=1 ãj 2 and concentration parameter κ i 1 j=1 ãj 2. This re-parameterized local probability is proportional to ( i 1 j=1 ãj) ã i, which measures the negative cosine similarity between ã i and its parent vectors. Thereby, ˆp(ã i pa(ã i )) still encourages large mutual angles between vectors as p(ã i pa(ã i )) does. The difference between ˆp(ã i pa(ã i )) and p(ã i pa(ã i )) is that in ˆp(ã i pa(ã i )) the term i 1 j=1 ãj 2 is moved from the denominator to the normalizer, thus we can avoid computing the expectation of 1/ i 1 j=1 ãj 2. Though it incurs a new problem that we need to compute the expectation of log C p (κ i 1 j=1 ãj 2 ), which is also hard due to the complex form of the C p ( ) function, we managed to resolve this problem as detailed in Section We refer to the mutual angular process (MAP) defined in Eq.(3.2) as type I MAP and that with local probability defined in Eq.(3.3) as type II MAP Case Study: Bayesian Mixture of Experts Model In this section, we apply the mutual angular process to promote diversity in the Bayesian mixture of experts model (MEM). MEM assumes that the input data is inherently belonging to multiple latent groups and one single expert is allocated to each group to handle the data therein. Here we consider a binary classification task and assume there are K latent experts where each expert is a classifier with coefficient vector β. Given a test example x, it first goes through a gate 21 j=1

27 function that decides which expert is best suitable to classify this example and the decision is made in a probabilistic way. A discrete variable z is utilized to indicate the selected expert and exp(ηk the probability that z = k (assigning example x to expert k) is x) K j=1 exp(η x), where η k is a j coefficient vector characterizing the selection of expert k. Given the selected expert, the example is classified using the coefficient vector β corresponding to that expert. As of now, the model parameters B = {β k } K k=1 and H = {η k} K k=1 are deterministic variables. Next, we place a prior over them to enable Bayesian learning [99] and desire this prior to be able to promote diversity among the experts to retain the advantages of diversification as stated before. The mutual angular process can be applied to achieve this goal A Variational Inference Algorithm In this section, we develop a variational inference (VI) algorithm for inferring the posteriors of components under the mutual angular process. The basic idea of VI [94] is to use a simpler variational distribution q(a) to approximate the true posterior by minimizing the Kullback- Leibler divergence between these two distributions, which is equivalent to maximizing the following variational lower bound w.r.t q(a): E q(a) [log p(d A)]+E q(a) [log p(a)] E q(a) [log q(a)], where p(a) is the mutual angular process and p(d A) is data likelihood. Here we choose q(a) to be a mean field variational distribution q(a) = K k=1 q(ã k)q(g k ), where q(ã k ) = vmf(ã k â k, ˆκ) and q(g k ) = Gamma(g k r k, s k ). Given the variational distribution, we first compute the analytical expression of the variational lower bound, in which we particularly discuss how to compute E q(a) [log p(a)]. If choosing p(a) as MAP-I (Eq.(3.2)), we need to compute i 1 j=1 ãj E[( i 1 ) ã j=1 ãj i ] which is very difficult to deal with due to the presence of 1/ i 1 2 j=1 ãj 2. Instead we choose MAP-II for the convenience of deriving the variational lower bound. Under MAP-II, we need to compute E q(a) [log Z i ] for all i, where Z i = 1/C p (κ i 1 j=1 ãj 2 ) is the partition function of p(ã i pa(ã i )). The analytical form of this expectation is difficult to derive as x well due to the complexity of the C p (x) function: C p (x) = p/2 1 where I (2π) p/2 I p/2 1 (x) p/2 1(x) is the modified Bessel function of the first kind at order p/2 1. To address this issue, we derive an upper bound of log Z i and compute the expectation of the upper bound, which is relatively easy to perform. Consequently, we obtain a further lower bound of the variational lower bound and learn the variational and model parameters w.r.t the new lower bound. Now we proceed to derive the upper bound of log Z i, which equals to log exp(κ( i 1 j=1 ãj) ã i )dã i. Applying the inequality log exp(x)dx γ + log(1 + exp(x γ))dx [16], where γ is a variational parameter, we have log Z i γ + log(1 + exp(κ( i 1 j=1 ãj) ã i γ)dã i. Then applying the inequality log(1+e x ) log(1+e ξ ) 1 (x 2 ξ 2 ) [16], where ξ is 2 another variational parameter and g(ξ) = 1/(1+exp( ξ)), we have log Z i γ+ [log(1+e ξ ) j=1 ãj) ã i +γ ξ) 1/2 g(ξ) ((κ( i 1 2ξ j=1 ãj) ã i +γ) 2 ξ 2 )]dã i. Finally, applying the following integrals on a high-dimensional sphere: (1) 2π(p+1)/2 y 2 1dy =, (2) =1 Γ( p+1 2 ) y 2 =1 x ydy = 0, (3) y 2 =1 (x y) 2 dy x 2 2 2π(p+1)/2, we get log Z Γ( p+1 2 ) i 1/2 g(ξ) κ 2 i 1 2ξ j=1 ãj 2 2 2π(p+1)/2 + γ + Γ( p+1 2 ) [log(1 + e ξ ) + 1 1/2 g(ξ) (ξ γ) + (ξ 2 γ 2 )] 2π(p+1)/2. The expectation of this upper bound 2 2ξ Γ( p+1 ) 1 (κ( i (x ξ) 1/2 g(ξ) 2ξ

28 K Gaussian DPP MAP-I MAP-II Table 3.1: Classification accuracy (%) on Adult-9 dataset K Gaussian DPP MAP-I MAP-II Table 3.2: Classification accuracy (%) on SUN-Building dataset is much easier to compute. Specifically, we need to tackle E q(a) [ i 1 j=1 ãj 2 2], which can be computed as E q(a) [ i 1 j=1 ãj 2 2] = i 1 j=1 tr(cov(ã j)) + i 1 i 1 j=1 k=1 E q(ã j )[ã j ] E q(ãk )[ã k ], where E q(ãj )[ã j ] = A p (ˆκ)â j, cov(ã j ) = h(ˆκ) ν+1 I + (1 2 h(ˆκ) ˆκ ˆκ h2 (ˆκ))â j â j, h(ˆκ) = I ν+1(ˆκ), I ν(ˆκ) A p (ˆκ) = I p/2(ˆκ) I p/2 1 and ν = p/2 1. (ˆκ) Evaluation Using the Bayesian mixture of experts model as an instance, we conducted experiments to verify the effectiveness and efficiency of the proposed approach. We used two binary-classification datasets: Adult-9 [76] and SUN-Building [104], and compared with two Bayesian priors: a Gaussian prior which imposes no diversity and a DPP [129] prior. Table 3.1 and 3.2 show the classification accuracy under different number of experts on the Adult-9 and SUN-Building dataset respectively. From these two tables, we observe the following. First, MAP-(I, II) outperform Gaussian, demonstrating the effectiveness of promoting diversity. Second, MAP-(I, II) outperform DPP, showing their better capability in promoting diversity than DPP. From the two tables, we can see that MAP-(I, II) with a smaller K can achieve accuracy that is comparable to or even better than Gaussian with a large K. For example, on the Adult-9 dataset (Table 3.1), with 5 experts MAP-I is able to achieve an accuracy of 87.1%, which cannot be achieved by Gaussian with even 30 experts. This corroborates the effectiveness of diversification in reducing the model size (hence computational complexity) without compromising performance. To check whether the mutual angular process can better capture infrequent patterns, from the RCV1 [59] dataset we pick up a subset of documents (for binary classification) such that the frequencies of categories (patterns) is in a power-law distribution. Specifically, we choose documents from 9 subcategories (the 1st row of Table 3.3) of the CCAT category as the positive instances, and randomly sample 15K documents from non-ccat categories as negative 23

29 Category ID C18 C17 C12 C14 C22 C34 C23 C32 C16 Num. of Docs Gaussian Accuracy (%) MAP-I Accuracy (%) Relative Improvement (%) Table 3.3: Accuracy on 9 subcategories of the CCAT category in the RCV1.Binary dataset Adult-9 SUN-Building DPP MAP-I MAP-II Table 3.4: Training time (hours) of different methods with K = 30 instances. The 2nd row shows the number of documents in each of the 9 categories. The distribution of document frequency is in a power-law fashion, where frequent categories (such as C18 and C17) have a lot of documents while infrequent categories (such as C32 and C16) have a few documents. The 3rd and 4th row show the accuracy achieved by Gaussian and MAP-I on each category. The 5th row shows the relative improvement of MAP-I over Gaussian. While achieving accuracy comparable to Gaussian over the frequent categories, MAP-I obtains much better performance on infrequent categories. For example, the relative improvements on infrequent categories C32 and C16 are 18.1% and 21.7%. This demonstrates that the mutual angular process can effectively capture the infrequent patterns. Table 3.4 compares the time (hours) taken by each method to achieve convergence, with the K set to 30. MAP-II inferred with variational inference (VI) is more efficient than DPP where VI is not applicable. 3.2 Diversity-Promoting Learning of Bayesian Nonparametric Models In this section, we study how to promote diversity in Bayesian nonparametric (BNP) models [31, 32, 41]. In BNP models, the number of components is unlimited and can reach infinite in principle. As more data accumulates, new components are dynamically added. Compared with parametric models, BNP models possess the following advantages: (1) they are highly flexible and adaptive: if new data cannot be well modeled by existing components, new components are automatically invoked; (2) in BNP models, the best number of components is determined according to the fitness to data, rather than set manually which is a challenging task even for domain experts. 24

30 3.2.1 Infinite Mutual Angular Process In the mutual angular process, the components are added one by one. Each new component is encouraged to have large angles with previous ones. This adding process can be repeated infinitely many times, resulting in an infinite mutual angular process (IMAP) [117] that encourages an infinite number of components to have large mutual angles p({â i } i=1) = p(â 1 ) p(â i pa(â i )) (3.4) The factorization theorem [48] of Bayesian network ensures that p({â i } i=1) integrates to one. The magnitudes {g i } i=1 do not affect angles (hence diversity), which can be generated independently from a gamma distribution. To this end, the generative process of {a i } i=1 can be summarized as follows: Sample â 1 vmf(µ 0, κ) i=2 i 1 j=1 âj For i = 2,,, sample â i vmf( i 1, κ) j=1 âj 2 For i = 1,,, sample g i Gamma(α 1, α 2 ) For i = 1,,, a i = â i g i The probability distribution over {a i } i=1 can be written as p({a i } i=1) = C p (κ) exp(κµ 0 â1) i=2 C p(κ) exp(κ( i 1 j=1 âj i Case Study: Infinite Latent Feature Model j=1 âj 2 ) â i ) α α 1 2 gα 1 1 i e g i α 2 i=1 Γ(α 1 ) In this section, on a specific Bayesian nonparametric model infinite latent feature model (ILFM) [37], we showcase how to promote diversity among the components therein with the IMAP. Given a set of data examples X = {x n } N n=1 where x n R D, ILFM aims to invoke a finite subset of features from an infinite feature collection A = {a k } k=1 to construct these data examples. Each feature (which is a component) is parameterized by a vector a k R D. For each data example x n, a subset of features are selected to construct it. The selection is denoted by a binary vector z n {0, 1} where z nk = 1 denotes the k-th feature is invoked to construct the n-th example and z nk = 0 otherwise. Given the parameter vectors of features {a k } k=1 and the selection vector z n, the example x n can be represented as: x n N ( k=1 z nka k, σ 2 I). The binary selection vectors Z = {z n } N n=1 can be either drawn from an Indian buffet process (IBP) [32] or a stick-breaking construction [90]. Let µ k be the prior probability that feature k is present in a data example and the features are permuted such that their prior probabilities are in a decreasing ordering: µ (1) > µ (2) >. According to the stick-breaking construction, these prior probabilities can be generated in the following way: ν k Beta(α, 1), µ (k) = ν k µ (k 1) = k l=1 ν l. Given µ (k), the binary indicator z nk is generated as z nk µ (k) Bernoulli(µ (k) ). To reduce the redundancy among the features, we impose the IMAP over their parameter vectors A to encourage them to be mutually different, which results in an IMAP-LFM model. (3.5) 25

31 Dataset IBP-LFM PYP-LFM IMAP-LFM Yale 447±7 432±3 419±4 Block ( 10 2 ) 6.3± ± ±0.2 AR 926±4 939±11 871±7 EEG ( ) 5382± ±15 575±21 Piano ( 10 4 ) 5.3± ± ±0.2 Table 3.5: L2 Test Error An MCMC Sampling Algorithm In this section, we develop a sampling algorithm to infer the posteriors of A and Z in the IMAP- LFM model. Two major challenges need to be addressed. First, the prior over A is not conjugate to the likelihood function p(x). Second, the parameter vectors A are usually high-dimensional, rendering slow mixing. To address the first challenge, we adopt the slice sampling algorithm [90]. This algorithm introduces an auxiliary slice variable s Z, µ (1: ) Uniform[0, µ ], where µ = min{1, min µ k} is the prior probability of the last active feature. A feature k is active k: n,z nk =1 if there exists an example n such that z nk = 1 and is inactive otherwise. In the sequel, we discuss the sampling of other variables. Sample New Features Let K be the maximal feature index with µ (K ) > s and K + be the index such that all active features have index k < K + (K + itself would be inactive feature). If the new value of s makes K K +, then we draw K K + +1 new (inactive) features, including the parameter vectors and prior probabilities. The prior probabilities {µ (k) } are drawn sequentially from p(µ (k) µ (k 1) ) exp(α N 1 n=1 (1 µ n (k)) n )µ α 1 (k) (1 µ (k)) N I(0 µ (k) µ (k 1) ) using adaptive rejection sampling (ARS) [33]. The parameter vectors are drawn sequentially from k 1 j=1 âj α 1 2 gα 1 1 i e g i α 2 Γ(α 1 ) p(a k {a j } k 1 j=1 ) = p(â k {â j } k 1 j=1 )p(r k) = C p (κ) exp(κ( k 1 ) â j=1 âj k ) α, where 2 we draw â k from p(â k {â j } k 1 j=1 ) which is a von Mises-Fisher distribution and draw r k from a Gamma distribution, then multiply â k and r k together since they are independent. For each new feature k, the corresponding binary selection variables z :,k are initialized to zero. Sample a k (k = 1,, K + ) We draw a k = ã k g k from (a) p(ã k g k rest) p(ã k g k {a j } K+ j k ) N n=1 p(x n z n,1:k +, {a j } K+ j k, ã kg k ), where p(ã k g k {a j } K+ j k ) p(ã kg k {a i } k 1 i=1 ) K + j=k+1 p(a j {a i } j 1 i k, ã kg k ) and p(x n z n,1:k +, {a j } K+ j k, ã kg k ) = N (x n ã k g k + K + j k z nja j, σi). In the vanilla IBP latent feature model [32], the prior over a k is a Gaussian distribution, which is conjugate to the Gaussian likelihood function. In that case, the posterior p(a k rest) is again a Gaussian, from which samples can be easily drawn. But in equation (a), the posterior does not have a closed form expression since the prior p(ã k g k {a j } K+ j k ) is no longer a conjugate prior, making the sampling very challenging. We sample ã k and g k separately. g k can be efficiently sampled using the Metropolis-Hastings (MH) [39] algorithm. For ã k which is a random vector, the sampling is much more difficult. The random walk based MH algorithm suffers slow mixing when the dimension of ã k is large. In ad- 26

32 IBP-LFM PYP-LFM IMAP-LFM Reuters 45.4± ± ±0.6 TDT 48.3± ± ± News 21.5± ± ± Scenes 22.7± ± ±0.2 Caltech ± ± ±0.2 Table 3.6: Clustering Accuracy (%) dition, ã k lies on a unit sphere. The sampling algorithm should preserve this geometric constraint. To address these two issues, we study a Riemann manifold Hamiltonian Monte Carlo (RM-HMC) method [18, 34]. HMC leverages the Hamiltonian dynamics to produce distant proposals for the Metropolis-Hastings algorithm, enabling a faster exploration of the state space and faster mixing. The RM-HMC algorithm introduces an auxiliary vector v R d and defines a Hamiltonian function H(â k, v) = log p(â k rest) + log G(â k ) v G(â k ) 1 v, where G is the metric tensor associated with the Riemann manifold, which in our case is a unit sphere. After a transformation of coordinate system, H(â k, v) can be re-written as H(â k, v) = log p(â rest) v v. p(â k rest) needs not to be normalized and log p(â k rest) κ( j 1 i k âi+â k g k k 1 j=1 âj j=k+1 k 1 j=1 âj 2 ) â k + K+ κ( j 1 ) â i k âi+â k g k j + N 2 n=1 can be generated by approximately solving a system of differential equations characterizing the Hamiltonian dynamics on the manifold [34]. Following [18], we solve this problem based upon geodesic flow Evaluation 1 (x σ n K + j k z nja j ) â k g k 1 â 2σ kg k 2 2. A new sample of â k We evaluate the effectiveness of the IMAP in alleviate overfitting, reducing model size without sacrificing modeling power and capturing infrequent patterns, on ten datasets from different domains including texts, images, sound and EEG signals. For each dataset, we use IMAP-LFM to learn the latent features A on the training set, then use A to reconstruct the test data. The reconstruction performance is measured with L2 error (the smaller, the better). Meanwhile, we use A to infer the representations Z of test data and perform data clustering on Z. Clustering performance is measured using accuracy [19]. We compared with two baselines: Indian buffet process LFM (IBP-LFM) [37] and Pitman-Yor process LFM (PYP-LFM) [89]. Table 3.5 presents the L2 error on the test set of the first five datasets. As can be seen, IMAP-LFM achieves much lower L2 error than IBP-LFM and PYP-LFM. Table 3.6 shows the clustering accuracy on the last 5 datasets which have class labels. IMAP-LFM outperforms the two baseline methods with a large margin. We conjecture the reasons are two-fold. First, IMAP places a diversity-biased structure over the latent features, which alleviates overfitting. In both IBP-LFM and PYP-LFM, the weight vectors of latent features are drawn independently from a Gaussian distribution, which is unable to characterize the relations among features. In contrast, IMAP-LFM imposes a structure over the features, encouraging them to be diverse and less 27

33 IBP-LFM PYP-LFM IMAP-LFM Yale 201±5 220±8 165±4 Block-Image 8±2 9±4 11±4 AR 257±11 193±5 176±8 EEG 14±2 9±2 12±1 Piano 37±4 34±6 28±3 Reuters 354±12 326±5 294±7 TDT 297±6 311±9 274±3 20-News 442±8 408±3 369±5 15-Scenes 192±3 218±5 171±8 Caltech ±7 113±6 96±6 Table 3.7: Number of features Category Frequency IBP-LFM Precision@100 (%) IMAP-LFM Precision@100 (%) Relative Improvement (%) Table 3.8: Per-category precision@100 on the Reuters dataset redundant. This structural constraint reduces the model complexity of LFM, thus alleviating overfitting on the training data and achieving better reconstruction of the test data. Second, diversified features presumably have higher representational power and are able to capture richer information and subtle aspects of data, thus achieving a better modeling effect. Table 3.7 shows the number of features (mean±standard error) obtained by each model when the algorithm converges. Analyzing Table simultaneously, we see that IMAP- LFM uses much fewer features to achieve better performance than the baselines. For instance, on the Reuters dataset, with 294 features, IMAP-LFM achieves a 48.2% clustering accuracy. In contrast, IBP-LFM uses 60 more features but achieves 2.8% (absolute) lower accuracy. This suggests that IMAP is able to reduce the size of LFM (the number of features) without sacrificing the modeling power. Because of IMAP s diversity-promoting mechanism, the learned features bear less redundancy and are highly complementary to each other. Each feature captures a significant amount of information. As a result, a small number of such features are sufficient to model data well. In contrast, the features in IBP-LFM and PYP-LFM are drawn independently from a base distribution, which lacks the mechanism to reduce redundancy. IMAP achieves more significant reduction of feature number on datasets with larger dimensions. This is possibly because higher-dimensional data contains more redundancy, giving IMAP a larger room to improve. To verify whether IMAP helps to better capture infrequent patterns, on the learned features of the Reuters dataset we perform a retrieval task and measure the precision@100 on each category (treated as a pattern). A category with more than 1000 documents is labeled as frequent. Table 3.8 shows the per-category precision. The last row shows the relative improvement of IMAP-LFM over IBP-LFM, defined as (P imap P ibp )/P ibp. As can be seen, on the infrequent categories 3-9, 28

34 IBP-LFM IMAP-LFM Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 government saddam nuclear turkish game president clinton olympic school space house white iraqi soviet gold clinton government team great operation baghdad clinton weapons government team legal lewinsky hockey institute shuttle weapons united united weapons season years nuclear good program research tax time work number april state work baseball study life years president spkr enemy man baghdad minister gold japanese satellite white baghdad president good number church weapons ball office launch united iraq people don work white india medal reading lunar state un baghdad citizens baseball united years april level device bill lewinsky state due years iraqi white winter number program Table 3.9: Visualization of features learned on the 20-News dataset IMAP-LFM achieves much better precision than IBP-LFM, while on the frequent categories 1 and 2, their performances are comparable. This demonstrates that IMAP is able to better capture infrequent patterns without losing the modeling power on frequent patterns. IMAP promotes diversity among the components, which pushes some of them away from the frequent patterns toward infrequent patterns, giving the infrequent ones a better chance to be captured. On the 20-News dataset, we visualize the learned features. For a latent feature with parameter vector w, we pick up the top 10 words corresponding to the largest values in w. Table 8 shows 5 exemplar features learned by IBP-LFM and IMAP-LFM. As can be seen, the features learned by IBP-LFM have much overlap and redundancy and are hard to distinguish, whereas those learned by IMAP-LFM are more diverse. 29

35 Chapter 4 Diversity-Promoting Learning III Analysis In the previous two chapters, we have demonstrated the effectiveness of the proposed regularizers and Bayesian priors in (1) better capturing infrequent patterns, (2) achieving better generalization performance and (3) reducing model size without sacrificing modeling power, via empirical studies. In this chapter, we provide theoretical analysis on why these regularizers/priors can achieve such effects. 4.1 Analysis of Better Capturing of Infrequent Patterns On one study case: distance metric learning under nonconvex Bregman matrix divergence regularization, we analyze why promoting diversity can improve the performance on infrequent patterns [115]. Specifically, the analysis focuses on the following component matrix: A = arg min A E S,D [ + 1 D 1 S (x,y) S Ax Ay 2 2 (x,y) D max(0, t Ax Ay 2 2) + λω φ (A) ] (4.1) We assume there are K classes (patterns), where class k has a distribution p k and the corresponding expectation is µ k. Each data sample in S and D is drawn from the distribution of one specific class. We define ξ k = E x pk [sup v 2 =1 v x ] and ξ = max k ξ k. Further, we assume A has full rank R (which is the number of the projection vectors), and let UΛU denote the eigen-decomposition of A A, where Λ = diag(λ 1, λ 2, λ R ) with λ 1 λ 2 λ R. We define an imbalance factor (IF) to characterize the performance on infrequent classes. Each class k is characterized by the corresponding expectation µ k, and the Mahalanobis distance between two classes j and k is defined as: d jk = (µ j µ k ) A A (µ j µ k ). Then the IF among all classes can be defined as: η = max j k d jk min j k d jk. (4.2) The motivation of such a definition is: for two frequent classes, since they have more training examples and hence contributing more in learning A (equation 4.1), DML intends to make their 30

36 distance d jk large; whereas for two infrequent classes, since they contribute less in learning (and equation 4.1 is constrained by similar pairs which need to have small distances), their distance may end up being small. Consequently, if classes are imbalanced, some between-class distances can be large while others small, resulting in a large IF. In the following theorem, we derive the upper bounds of IF. Theorem 1. Let C denote the ratio between max j k µ j µ k 2 2 and min j k µ j µ k 2 2 and assume max j,k µ j µ k 2 B 0. If R K 1 and ξ ( B 0 + B λ K 1 β K 1 /(2tr(Λ))/4, then we have the following bounds for the IF. For the VND regularizer Ω vnd (A ), the following bound of the IF η holds: η Cg(Ω vnd (A )) where g( ) is an increasing function defined in the following way. Let f(c) = c 1/(c+1) (1 + 1/c), which is strictly increasing on (0, 1] and strictly decreasing on [1, ) and let f 1 (c) be the inverse function of f(c) on [1, ), then g(c) = f 1 (2 c) for c < 1. For the LDD regularizer Ω ldd (A ), we have η 4Ce Ω ldd(a ) As can be seen, the bounds are increasing functions of the BMD regularizers Ω vnd (A ) and Ω ldd (A ). Decreasing these regularizers would reduce the upper bounds of the imbalance factor, hence escalating the performance on infrequent classes such that it is close to that on frequent classes. 4.2 Analysis of Generalization Errors In this section, we analyze how diversity-promoting regularization can improve the generalization performance on unseen data Generalization Error Analysis for Angular Constraints Using neural networks as a study case, we analyze how the angular constraints affect the generalization performance. The generalization error of a hypothesis f represented with a neural network is defined as L(f) = E (x,y) p [l(f(x), y)], where p is the distribution of input-output pair (x, y) and l( ) is the loss function. The training error is ˆL(f) = 1 n n i=1 l(f(x(i) ), y (i) ), where n is the number of training samples. Let f argmin f F L(f) be the true risk minimizer and ˆf argmin f F ˆL(f) be the empirical risk minimizer. We aim to analyze the generalization error L( ˆf) of the empirical risk minimizer ˆf. L( ˆf) can be decomposed into L( ˆf) = L( ˆf) L(f ) + L(f ), where L( ˆf) L(f ) is the estimation error and L(f ) is the approximation error. For simplicity, we start with a simple fully connected network with one hidden layer of m units, used for univariate regression (one output unit) with squared loss. Analysis for more complicated fully connected NNs with multiple hidden layers, with loss functions designed for classification tasks, with multiple output units is easy to follow. Let x R d be the input vector 31

37 with x 2 C 1 and y be the response value with y C 2. Let w j R d be the weights connecting the j-th hidden unit with the input units, with w j 2 C 3. Let α be a vector where α j is the weight connecting hidden unit j to the output unit, with α 2 C 4. We assume the activation function h(t) applied on the hidden units is Lipschitz continuous with constant L. Commonly used activation functions such as rectified linear h(t) = max(0, t), tanh h(t) = (e t e t )/(e t + e t ), and sigmoid h(t) = 1/(1 + e t ) are all Lipschitz continuous with L = 1, 1, 0.25 respectively. Let F denote the hypothesis set {f f(x) = m j=1 α jh(wj x)} and A denote the loss function set {l l(f(x), y) = (f(x) y) 2 }. The estimation error represents how well the algorithm is able to learn, which is given in Theorem 2. Theorem 2. With probability at least (1 δ) L( ˆf) L(f ) 8( J + C 2 )(2LC 1 C 3 C 4 + C 4 h(0) ) m n + ( J + C 2 ) 2 2 log(2/δ) n (4.3) where J = mc4h 2 2 (0) + L 2 C1C 2 3C 2 4((m 2 1)τ + 1) + 2 mc 1 C 3 C4L h(0) 2 (m 1)τ + 1. Note that the right hand side is an increasing function w.r.t τ. A smaller τ would induce a lower estimation error bound. The bound goes to zero as n (sample size) goes to infinite. The rate of our bound is O( m n ). Since n m, we can omit m, then the bound becomes O( 1 n ) which matches with existing bounds [10]. The approximation error represents how capable the hypothesis set F is to approximate a target function g = E[y x], where the error is measured by min f F f g L 2 and f g L 2 = (f(x) x g(x))2 dx. Following [9], we assume the target function g satisfies certain smoothness assumption expressed in the first moment of its Fourier representation: x x 2 g(x) dx C g where g(x) is the Fourier representation of g(x). The following theorem states the approximation error. Theorem 3. Given C g > 0, if C 1 C 3 1, C 4 2 mc g, m 2( π 2 arccos(τ) + 1) then there is arccos(τ) a function f F such that f g L 2 2C g ( 1 m ln(c 1C 3 ) C 1 C 3 min(3m arccos(τ),π) ) + 4mC g C 1 C 3 sin( ) 2 This theorem implies that whether to use the angular constraints (ACs) or not has a significant influence on the approximate error bound: without using ACs (τ = 1), the bound is a decreasing function of m (the number of hidden units); using ACs (τ < 1), the bound increases with m. This striking phrase-change indicates the impact of ACs. Given a fixed m, the bound decreases with τ, which implies that a stronger regularization (smaller τ) incurs larger approximation error. Our bound is a generalization of that in [9]. When τ = 1, the second term in the bound vanishes and the bound is reduced to the one in [9], which is a decreasing function of m. When τ < 1, the second term increases with m with rate O(m), which dominates the first term O(1/ m). The overall bound increases with m. This is because a larger number of hidden units bears a larger difficulty in satisfying the pairwise AC, which causes the function space F to shrink rapidly; accordingly, the approximation power of F decreases quickly. The analysis of the two theorems shows that τ incurs a tradeoff between estimation error and approximation error: decreasing τ reduces estimation error and enlarges approximation error. Since generalization error is the sum of estimation error and approximation error, τ has an optimal value to yield the minimal generalization error. 32

38 4.2.2 Estimation Error Analysis for Nonconvex Bregman Matrix Divergence Regularizers Among the three nonconvex Bregman matrix divergence (BMD) regularizers based on squared Frobenius norm, von Neumann divergence and log determinant divergence (LDD), we choose LDD to perform the analysis, on a distance metric learning (DML) model. We conduct the analysis in two steps. First, we prove that decreasing LDD amounts to decreasing the absolute value of the cosine similarity (AVCS) of component vectors. Then we show that the upper bound of estimation error is an increasing function of the AVCS. Combining the two pieces together, we conclude that reducing LDD can decrease the estimation error bound (EEB). We begin with the first step. Given the m component vectors A = {a i } m i=1 in DML, let be the AVCS between two components a i and a j, and s(a) = max 1 i<j m s ij be the maximal AVCS among all pairs of components. We prove that the gradient of LDD Ω ldd (A) is an ascent direction of s(a), which is formally given in the following lemma. Lemma 1. Let Â = {â i} m i=1 be another set of vectors where â i = a i + ηg i and g i is the gradient s ij = a i,a j a i 2 a j 2 of Ω ldd (A) w.r.t a i. Then δ > 0 such that η (0, δ), s(â) s(a). This implies that Ω ldd (A) and s(a) are closely aligned. Decreasing Ω ldd (A) effectively decreases s(a). Next, we show that the EEB of DML is an increasing function of s(a). We assume the following norm bounds: a 2 B for the component vector a; x 2 B for the data example x 2 C. Let u denote the hypothesis of DML and n denote the number of training examples. Then we have Theorem 4. With probability at least 1 δ L(u) L(u) 8B2 C 2 m (1+exp( J)) + log(1 + exp(j)) 2 log(2/δ) (4.4) n n where J = 4B 2 C 2 ((m 1)s(A) + 1). Combining the two pieces together: (1) decreasing LDD can decrease s(a), (2) decreasing s(a) can decrease the EEB, we conclude that decreasing LDD can decrease the EEB. 33

39 Chapter 5 Large-Scale Learning via System and Algorithm Co-design In this chapter, we present another line of research in this thesis: large-scale machine learning, where we design efficient distributed systems [105, 111] for healthcare applications at scale. We adopt a system-algorithm co-design approach: (1) system design is tailored to the unique mathematical properties of ML algorithms, and (2) algorithms can be re-designed to better exploit the system architecture. 5.1 Sufficient Factor Property We first introduce a mathematical property of a large family of ML models that admit the following optimization formulation: (P) min W 1 N N f i (Wa i ) + h(w) (5.1) i=1 where the model is parametrized by a matrix W R J D. The loss function f i ( ) is typically defined over a set of training samples {(a i, b i )} N i=1, with the dependence on b i being suppressed. We allow f i ( ) to be either convex or nonconvex, smooth or nonsmooth (with subgradient everywhere); examples include l 2 loss and multiclass logistic loss, amongst others. The regularizer h(w) is assumed to admit an efficient proximal operator prox h ( ). For example, h( ) could be an indicator function of convex constraints, l 1 -, l 2 -, trace-norm, to name a few. The vectors a i and b i can represent observed features, supervised information (e.g., class labels in classification, response values in regression), or even unobserved auxiliary information (such as sparse codes in sparse coding [72]) associated with data sample i. The key property we exploit below ranges from the matrix-vector multiplication Wa i. This optimization problem (P) can be used to represent a rich set of ML models [21, 56, 72, 118]. To solve the optimization problem (P), it is common to employ either (proximal) stochastic gradient descent (SGD) [21, 26, 42, 60] or stochastic dual coordinate ascent (SDCA) [43, 84], both of which are popular and well-established parallel optimization techniques. 34

40 Proximal SGD: In proximal SGD, a stochastic estimate of the gradient, W, is first computed over one data sample (or a mini-batch of samples), in order to update W via W W η W (where η is the learning rate). Following this, the proximal operator prox ηh ( ) is applied to W. Notably, the stochastic gradient W in (P) can be written as the outer product of two vectors W = uv, where u = f(wa i,b i ) (Wa i, v = a ) i, according to the chain rule. Later, we will show that this low rank structure of W can greatly reduce inter-worker communication. Stochastic DCA: SDCA applies to problems (P) where f i ( ) is convex and h( ) is strongly convex (e.g. when h( ) contains the squared l 2 norm); it solves the dual problem of (P), via stochastic coordinate ascent on the dual variables. Introducing the dual matrix U = [u 1,..., u N ] R J N and the data matrix A = [a 1,..., a N ] R D N, the dual problem of (P) can be written as (D) min U 1 N N fi ( u i ) + h ( 1 N UA ) (5.2) i=1 where fi ( ) and h ( ) are the Fenchel conjugate functions of f i ( ) and h( ), respectively. The primal-dual matrices W and U are connected by W = h (Z), where the auxiliary matrix Z := 1 N UA. Algorithmically, we need to update the dual matrix U, the primal matrix W, and the auxiliary matrix Z: every iteration, we pick a random data sample i, and compute the stochastic update u i by minimizing (D) while holding {u j } j i fixed. The dual variable is updated via u i u i u i, the auxiliary variable via Z Z u i a i, and the primal variable via W h (Z). Similar to SGD, the update of Z is also the outer product of two vectors: u i and a i, which can be exploited to reduce communication cost. Sufficient Factor Property in SGD and SDCA In both SGD and SDCA, the parameter matrix update can be computed as the outer product of two vectors we call these sufficient factors (SFs). The SFs that are generated with respect to one data example and atomically produces a parameter update is referred to as a sufficient factor group (SFG). This property can be leveraged to improve the communication efficiency of distributed ML systems: instead of communicating parameter/update matrices among machines, we can communicate the SFs and reconstruct the update matrices locally at each machine. Because the SFs are much smaller in size, synchronization costs can be dramatically reduced. 5.2 Orpheus: a Light-Weight Peer-to-Peer System In this section, we present a peer-to-peer framework where the system design is driven by the sufficient factor property Communication We leverage the sufficient factor property to reduce communication cost. To ensure the consistency among different parameter replicas, the updates computed at different machines need to be 35

41 exchanged. One popular system architecture that enables this is parameter server (PS) [21, 24, 26, 62, 100], which conceptually consists of a server machine that maintains the shared state of the parameters and a set of worker machines each having a local cache of the parameters. In PS, the updates computed at worker machines are aggregated at the server and applied to the shared state. The shared state is subsequently sent back to workers to refresh their local caches. When PS is used to train MPMs, the update and parameter matrices which could contain billions of elements [55] are transferred, incurring substantial communication overhead. Sufficient Factor Broadcasting Since an update matrix (UM) can be computed from a few SFs, sending a UM from machine A to B can be equivalently done by first transferring the SFs from A to B, then producing the UM from the SFs received at B. The communication cost of transmitting SFs is O(J + K) which is linear in matrix dimensions while that of transmitting UMs is O(JK) which is quadratic in matrix dimensions. Hence SF-transfer can greatly reduce communication overhead. The transformation from SFs to a UM is mathematically exact, without compromising computational correctness. In PS, the one-sided communication cost from workers to the server can be reduced by transmitting SFs [21]: each worker sends the new SFGs to the server, where the received SFGs are transformed to UMs to update the shared parameter state. However, since the parameter matrix cannot be computed from a few SFs, from the server to workers the newly-updated parameters need to be sent as a matrix, which still incurs high communication overhead. To avoid transmitting parameter matrices, inspired by [108], Orpheus adopts a decentralized peer-to-peer architecture where worker machines synchronize their parameter replicas by exchanging updates in the form of SFs. In each clock, each worker computes SFGs and broadcasts them to other workers; meanwhile, each worker converts the SFGs received remotely into UMs which are subsequently added to its parameter replica. We refer to this computation model as sufficient factor broadcasting. Unlike PS, the P2P architecture does not maintain the shared parameter state and can completely avoid transmitting any type of matrices. While SF-transfer greatly reduces communication cost, it increases computation overhead. Each SFG is transformed into the same update multiple times (on different receivers). However, in-memory computation is usually much more efficient than inter-machine network communication, especially with the advent of GPU computing, hence the reduction in communication cost overshadows the increase of computation overhead. Random Multicast While the P2P transfer of SFs greatly reduces the size of each message (from a matrix to a few vectors), its limitation is SFGs need to be sent from each machine to every other machine, which renders the number of messages per clock to be quadratic in the number of machines P. To address this issue, Orpheus adopts random multicast: in each clock, each machine randomly selects Q(Q < P 1) machines to send SFs to. This cuts messages per clock from O(P 2 ) to O(P Q). Though communicating with a subset of machines causes synchronization delays, the correctness of execution can still be guaranteed, thanks to ML-programs tolerance to errors [42]. 36

42 Unlike a deterministic multicast topology [60, 108] where each machine communicates with a fixed set of machines throughout the application run, random multicast provides several benefits. First, dynamically changing the topology in each clock gives every two machines a chance to communicate directly, which facilitates more symmetric synchronization. Second, random multicast is more robust to network connection failures since the failure of a network connection between two machines will not affect their communication with another one. Third, random multicast makes resource elasticity simpler to implement: adding and removing machines require minimal coordination with existing ones, unlike a deterministic topology which must be modified every time a worker joins or leaves. SF Selection In ML practice, parameter updates are usually computed over a small batch (whose size typically ranges from tens to hundreds) of examples. At each clock, a batch of B training examples are selected and an update is generated with respect to each example. When represented as matrices, these B updates can be aggregated into a single matrix to communicate. Hence the communication cost is independent of B. However, this is not the case in sufficient factors transfer: the B SFGs cannot be aggregated into one single SFG; they must be transferred individually. Therefore, communication cost grows linearly with B. To alleviate this cost, Orpheus provides SF selection (SFS), which chooses a subset of C SFGs (where C < B) that best represent the entire batch to communicate. We design an efficient sampling-based algorithm called joint matrix column subset selection (JMCSS) to perform SFS. Given the K matrices M 0,, M K 1 where M k stores the k-th SF of all SFGs, JMCSS selects a subset of non-redundant column vectors from each matrix to approximate the entire matrix. The selection of columns in different matrices are tied together, i.e., if the i- th column is selected in one matrix, for all other matrices, their i-th column must be selected as well to atomically form a SFG. Under JMCSS, the aggregated update generated from the C SFGs is close to that computed from the entire batch. Hence SFS does not compromise parameter-synchronization quality Computation In this section, we leverage the sufficient factor property to speed up computation. SF-Based Representation of Parameters We first present an SF-based representation (SFR) of the parameters. At clock T, the parameter state W T is mathematically equal to W 0 + T t=1 W t where W t is the update matrix computed at clock t and W 0 is the initialization of the parameters. As noted earlier, W t can be computed from a SFG G t : W t = h(g t ), using a transformation h. To initialize the parameters, we can randomly generate an SFG G 0, then let W 0 = h(g 0 ). To this end, the parameter state can be represented as W T = T t=0 h(g t), using a set of SFGs. The SFR can be leveraged to reduce computation cost. First of all, since no parameter matrix needs to be maintained, we do not need to explicitly compute the update matrix in each clock, which otherwise incurs O(JK) cost. 37

43 Second, in most matrix-parameterized models, a major computation workload is to multiply the parameter matrix by a vector, whose cost is quadratic in matrix dimensions. We aim to reduce this cost by executing the multiplication in a SF-aware way. The details are given in the following subsection. SF-Aware Multiplication and Tree Rewriting We first define what SF-aware multiplication is. For ease of illustration, we start with a simple example. In MLR, each SFG contains two SFs u, v, whose outer product uv produces a parameter update. Consequently, the SFR of W T is T t=0 u tvt. The multiplication between W T and a vector x can be computed in the following way: W T x = ( T t=0 u tvt )x = T t=0 u t(vt x), which first calculates the inner product vt x between v t and x, then multiplies the inner product with u t. The computation cost is O(T (J + K)), which is linear in matrix dimensions and grows with T. As another example, in BFGS [11], each SFG contains two SFs and the update is computed as W = uu vv. Then W T is represented as T t=0 u tu t v t vt and W T x can be computed as ( T t=0 u tu t v t v t )x = T t=0 u t(u t x) v t (vt x), whose cost is O(T (J + K)) as well. When T is small, SF-aware multiplication is highly efficient. Orpheus uses a multiplication tree (MT) to perform SF-aware multiplication. A MT is rewritten from an updating tree (UT) built by parsing the compute update function which is either defined by users or automatically identified by the system. At the leaf nodes of the UT are SFs and at the internal nodes are operations. An in-order traversal of the UT transforms the SFs into an update matrix: at each internal node, the associated operation is applied to the data objects (either SFs or matrices) at its two children; the update matrix is obtained at the root. Given this UT, it is rewritten into an MT. For each subtree in the UT, if the operation at the root is vector outer-product (denoted by ) and children of the root are two SFs sv0 and sv1, then the subtree is transformed into a new tree with three layers: at the root is scalar-vector multiplication * ; at the two children of the root are sv0 and vector inner-product ; the two children of are sv1 and x (the vector involved in W T x). + and representing matrix addition/subtraction in the UT are replaced with vector addition/subtraction in the MT. To compute W T x, where W T is represented with T + 1 SFGs, we feed the SFs in each SFG and x into the leave nodes of MT, then perform an in-order traversal to get a vector at the root. W T x is obtained by adding up the vectors generated from all SFGs. 5.3 Convergence Analysis In this subsection, we provide theoretical analysis for the convergence of algorithm under the sufficient factor broadcasting computation model. Specifically, we study the convergence of minibatch SGD. Since SFB is a peer-to-peer decentralized computation model, we need to show that parameter copies on different workers converge to the same limiting point without a centralized coordination, even under delays in communication due to bounded asynchronous execution. In this respect, we differ from analysis of centralized parameter server systems [42], which instead show the convergence of global parameters on the central server. 38

44 We wish to solve the optimization problem min M W m=1 f m(w), where M is the number of training data minibatches, and f m corresponds to the loss function on the m-th minibatch. Assume the training data minibatches {1,..., M} are divided into P disjoint subsets {S 1,..., S P } with S p denoting the number of minibatches in S p. Denote F = M m=1 f m as the total loss, and for p = 1,..., P, F p := j S p f j is the loss on S p (on the p-th machine). Consider a distributed system with P machines. Each machine p keeps a local variable W p and the training data in S p. At each iteration, machine p draws one minibatch I p uniformly at random from partition S p, and computes the partial gradient j I p f j (W p ). Each machine updates its local variable by accumulating partial updates from all machines. Denote η c as the learning rate at c-th iteration on every machine. The partial update generated by machine p at its c-th iteration is denoted as U p (Wp, c Ip) c = η c S p j I f j(w p p). c Note that I c c p is random and the factor S p is to restore unbiasedness in expectation. Then the local update rule of machine p is Wp c = W 0 + P τ q p (c) q=1 t=0 U q (Wq, t Iq) t 0 (c 1) τp q (c) s where W 0 is the common initializer for all P machines, and τ q p (c) is the number of iterations machine q has transmitted to machine p when machine p conducts its c-th iteration. Clearly, τ p p (c) = c. Note that we also require τ q p (c) c 1, i.e., machine p will not use any partial updates of machine q that are too fast forward. This is to avoid correlation in the theoretical analysis. Hence, machine p (at its c-th iteration) accumulates updates generated by machine q up to iteration τ q p (c), which is restricted to be at most s iterations behind. This formulation, in which s is the maximum staleness allowed between any update and any worker, covers bulk synchronous parallel (BSP) full broadcasting (s = 0) and bounded-asynchronous full broadcasting (s > 0). The following standard assumptions are needed for our analysis: Assumption 1. (1) For all j, f j is continuously differentiable and F is bounded from below; (2) F, F p are Lipschitz continuous with constants L F and L p, respectively, and let L = P p=1 L p; (3) There exists B, σ 2 such that for all p and c, we have (almost surely) W c p B and E S p j I p f j (W) F p (W) 2 2 σ 2. Our analysis is based on the following auxiliary update W c = W 0 + P q=1 (5.3) c 1 t=0 U q(w t q, I t q), (5.4) Compare to the local update (5.3) on machine p, essentially this auxiliary update accumulates all c 1 updates generated by all machines, instead of the τp q (c) updates that machine p has access to. We show that all local machine parameter sequences are asymptotically consistent with this auxiliary sequence: Theorem 5. Let {Wp}, c p = 1,..., P, and {W c } be the local sequences and the auxiliary sequence generated by SFB for problem (P) (with h 0), respectively. Under Assumption 1 and set the learning rate η c = O( lim inf c 1 Lσ 2 P sc ), then we have E F (Wc ) = 0, hence there exists a subsequence of F (W c ) that almost surely vanishes; lim c max p W c W c p = 0, i.e. the maximal disagreement between all local sequences and the auxiliary sequence converges to 0 (almost surely); 39

45 mlr, 325k, 12 machines, stalenss=20 mlr, 325k, 34 machines, stalenss=20 scale January 4, mlr, 325k, machines, stalenss= mlr, 325k, machines, stalenss= Convergence time (h) Spark Gopal TensorFlow Bosen MXNet Orpheus Number of CPU machines Convergence time (h) TensorFlow MXNet Orpheus Number of GPU machines Figure 5.1: Convergence time (hours) in Orpheus and baseline systems for MLR (left) and LSTM (right), under different number of CPU/GPU machines. There exists a common subsequence of {Wp} c and {W c } that converges almost surely to a stationary point of F, with the rate min E ( ) P c C p=1 F p(wp) c 2 Lσ 2 O 2 P s C Intuitively, Theorem 5 says that, given a properly-chosen learning rate, all local worker parameters {W c p} eventually converge to stationary points (i.e. local minima) of the objective function F, despite the fact that SF transmission can be delayed by up to s iterations. Thus, SFB learning is robust even under bounded-asynchronous communication (such as SSP). Our analysis differs from [12] in two ways: (1) Bertsekas and Tsitsiklis [12] explicitly maintains a consensus model which would require transmitting the parameter matrix among worker machines a communication bottleneck that we were able to avoid; (2) we allow subsampling in each worker machine. Accordingly, our theoretical guarantee is probabilistic, instead of the deterministic one in [12]. Linear Spark Gopal Tensor Bosen MXNet FlexiFa Orpheu 5.4 Evaluation We evaluate Orpheus on two ML applications: multiclass logistic regression (MLR) and long short-term memory network (LSTM) and compare with several state-of-the-art ML systems. Convergence Speed We first evaluate the convergence speed of our system. By convergence, it means two systems reach the same converged loss value. A better system takes less time to converge. Figure 6.5 shows the convergence time of each system, under different number of machines. Orpheus converges faster than parameter server (PS) based systems including Bosen [101], TensorFlow [4] and MXNet [20]. On MLR, with 34 CPU machines, the speedup of Orpheus over Bosen, TensorFlow and MXNet are 4.8x, 5.9x and 5.6x respectively. On LSTM, with 40 GPU machines, Orpheus is 5.0x faster than TensorFlow and 4.1x faster than MXNet. This is because Orpheus is more efficient in communication and computation. By transmitting small vectors using P2P SF-transfer, reducing the number of network messages using random multicast and the number of sent SFs using SF selection, Orpheus greatly reduces network traffic and time spent waiting for network communication. In contrast, PS-based systems transmit largesized matrices, which incurs substantial communication overhead. As for computation, the SF- 40

46 s=20 Speedup s= Linear 12 Orpheus Number of CPU machines in MLR s=20 Speedup s= January 4, 2018 Linear 24 Orpheus Number of GPU machines in LSTM Speedup Spark Gopal TensorFlow Bosen 2.2 MXNet FlexiFaCT Orpheus MLR LSTM Linear Spark Gopal TensorFlow Bosen MXNet FlexiFaCT Orpheus Spark CMS SFB DM RM Figure 5.2: Scalability with more machines: (left) Orpheus-MLR; 30k (middle) 11.8 Orpheus-LSTM; (right) comparison with baseline systems on MLR and MLR k LSTM k LSTM. Convergence time (h) Ordinary MVM MLR SV-Aware MVM LSTM Network traffic (TB) Spark Bosen Orpheus 30k 100k 325k Number of Classes in MLR Figure 5.3: (Left) Convergence time in Orpheus under SF-aware and ordinary multiplication for two applications. (Right) Network traffic (TB) in Spark, Bosen and Orpheus for MLR under different number of classes. aware multiplication in Orpheus significantly reduces computational time by bringing matrixlevel computation down to vector-level, which is not explored in PS systems. Orpheus also outperforms other non-ps systems. Using 34 CPU machines, Orpheus is 13.1x and 10.2x faster than Spark and Gopal on MLR, and 10.1x faster than FlexiFaCT. Gopal outperforms Spark possibly because it uses a better distributed-algorithm which is based on modelparallelism. However, it is slower than PS systems due to the disk IO overhead of Hadoop. So is FlexiFaCT, which is a Hadoop-based system. Spark is at least two times slower than PS systems (implemented with C++) due to the overhead incurred by Resilient Distributed Datasets and Java Virtual Machine. Scalability We evaluated how Orpheus scales up as the number of machines increases. The results on MLR and LSTM are shown in Figure 5.2 (left and middle), where we observe close to linear (ideal) speedup. With 34 CPU machines, Orpheus achieved 30.4x speedup on MLR. With 40 GPU machines, a 35.4x speedup is achieved on LSTM. The scalability of Orpheus is better than baseline systems, as shown in Figure 5.2(right), in which we measured the speedups for MLR when the number of CPU machines increases from 12 to 34 and for LSTM when the number of GPU machines increases from 12 to 40. SF-Aware Computation To evaluate the efficiency of SF-aware matrix-vector multiplication (MVM), we replaced it with the ordinary MVM (in the first phase of two-phase hybrid). Figure 5.3(left) compares the convergence time under these two ways of multiplication for three applications. As can be seen, SF-aware multiplication greatly speeds up convergence. At the 41

47 8.60E E E E E E E E E E E E E E E E Number of clocks Bosen Orpheus Time (Seconds) Loss January 4, 2018 Bosen Orpheus Number of Clocks Loss Bosen Orpheus Time (Seconds) Figure 5.4: (Left) 16 Clocks 1.8 vs running time. (Middle) 75 Loss 2.08 vs clocks. (Right) Loss vs running time Convergence time (h) Q Convergence time (h) C Figure 5.5: (Left) How the system parameter Q in random multicast affects the running time of Orpheus for MLR. (Right) How the system parameter C in SF selection affects the running time of Orpheus for MLR. early stage of execution, the computational complexity of SF-aware MVM is linear in matrix dimensions while that in ordinary MVM is quadratic. Hence SF-aware MVM is much more efficient. P2P Transfer of SFs We compared the communication overhead of Orpheus with Spark and Bosen. The experiments were performed on MLR, with 12 machines. To avoid the confounding effects of other system features, we switched SSP to BSP, multicast to broadcast, SF-aware multiplication to ordinary multiplication and did not use SF selection. Figure 5.3(right) shows the network traffic (TB) in three systems under different number of classes in MLR. Orpheus has significantly lower network traffic (more than 10 times) than Bosen and Spark. This is because Orpheus communicates small vectors while Bosen and Spark transmit large matrices. In Figure 5.4, the left subfigure shows that Orpheus executes more clocks per second than Bosen thanks to its faster speed in network message transfer. On the other hand, Orpheus and Bosen achieve the same amount of decrease in loss value per clock (middle subfigure) because Orpheus converts the SFs into exactly the same update matrices as in Bosen. Combining the two aspects, Orpheus achieves faster convergence time (right subfigure). Random Multicast (RM) We studied how the system parameter Q the number of destinations each machine sends messages to in RM affects the running time of Orpheus. Figure 5.5(left) shows the convergence time of MLR under varying Q. As can be seen, multicast (e.g., Q = 4) reduces the running time of broadcast (Q = 33), because it decreases the number of network messages. However, if Q is too small (e.g., Q = 1), multicast performs worse than 42

48 DM RM MLR LSTM DM RM MLR LSTM January 4, 2018 Convergence time (h) Deterministic MLR Random LSTM Relative increase of time (%) Deterministic MLR Random LSTM Figure 5.6: (Left) Convergence time in Orpheus under deterministic and random broadcast for MLR and LSTM. (Right) Relative increase of convergence time when network connection fails in deterministic and random broadcast. broadcast, due to the severe synchronization delays. We compared RM with a deterministic multicast (DM) scheme [60, 108] where each sending machine sends messages to a fixed set of 4 receiving machines; the set of 4 receiving machines is different for each sending machine and is chosen to balance network load across the cluster and prevent network hot spots. Figure 5.6(left) shows the convergence time of MLR and LSTM under RM and DM. In both applications, RM takes less time than DM to converge. This is because the randomly changing multicast topology enables each pair of machines to have some chance to communicate directly, thus facilitating more symmetric (hence faster) synchronization of parameter replicas. To examine the robustness of RM against network connection failures, we simulated the effect that in each iteration the connections between 10% of machine pairs are broken randomly. Figure 5.6(right) shows the relative increase of convergence time when failure happens. As can be seen, the relative increase under RM is much smaller than that under DM, confirming that RM is more robust. SF Selection (SFS) We studied how the system parameter C the number of selected SFs in SFS affects convergence of Orpheus applications. Figure 5.5(right) shows the convergence time for MLR under varying C. Compared with using the entire batch (C = 100), selecting a subset (e.g., C = 25) of SFs to communicate significantly speeds up convergence, thanks to the reduced network traffic. On the other hand, C cannot be too small, which otherwise incurs large approximation errors in parameter updates. For example, the convergence time under C = 5 is worse than that under C =

49 Chapter 6 Applications to Healthcare In this section, we apply the diversity-promoting learning techniques and large-scale learning systems to several applications in the healthcare domain. 6.1 Discharge Medication Prediction at Admission Time For a newly admitted patient, it is important to predict the medications placed on the patient at discharge time based on the information available at admission time. A successful prediction of discharge medications provides physicians with guidance on what type of medication regimen to plan for and what possible changes on initial medication may occur during an inpatient stay. Specifically, in an inpatient setting, patients are admitted on their home medications and due to various reasons including the cause for admission, the condition of the patient, diagnosis, and other co-morbidities, the patients medications are changed throughout the inpatient stay and can be different at the time of discharge. For example, a chronic kidney disease patient with chronic heart failure and hypertension could have been admitted for a heart failure exacerbation and then require changes to their anti-hypertensive medication. In this case, it would be helpful for the physician to understand what medications are better to add or remove through analysis of past cases given that in those situations one medication can improve one disease at the cost of exacerbating another. Sometimes it could be difficult for human physicians to balance the pros and cons in that situation, who are left without a good way to make that decision. To help human physicians to predict discharge medications more accurately and timely, we investigate a machine learning approach. Two issues make this task challenging. First, information available upon admission is mostly documented in unstructured clinical notes (called admission notes), such as past medical history, family and social history, allergies, etc. Compared with structured information such as labs and vital signs, the free-form texts are more difficult to process and to understand for machines. The notes contain synonyms, abbreviations, and misspellings. Distilling semantic patterns from such unstructured and noisy texts is very challenging. Second, a typical pharmacological treatment usually involves multi-medication therapy, where medications are prescribed in combination because they have been shown (in clinical guidelines or medical consensus) to have a certain impact on mortality/disease progression when used to- 44

50 gether. For example, for those patients who have had a recent stroke while already on aspirin, dual antiplatelet therapy with aspirin and clopidogrel will be recommended for future stroke prevention. How to automatically discover and leverage such pharmacological correlations among medications is crucial for more accurate multiple-medication prediction and is highly non-trivial, as it requires consideration of the interaction between medications Methods The prediction of discharge medications can be formulated as a subset selection problem: given the admission note (including admission medications) and the K candidate medications Y = {1,, K}, we aim at predicting a subset S Y that are most likely to be prescribed to the patient as discharge time. This is an NP-hard problem since S has infinitely many choices. The selection of S needs to consider two factors: (1) the medications in S should be highly relevant to x; (2) the relations among medications, including co-occurrence, adversarial and synergistic interaction should be incorporated, which effectively eliminates clinically-inconsistent medications and reduces the search space of S. However, it greatly complicates computation. We aim at designing methods that are able to capture correlations of any order, but also computationally tractable. To achieve this goal, we resort to determinantal point process (DPP) [50], which defines a probability distribution over subsets. Given a set of items {a i } K i=1, each represented with a vector a, DPP computes a kernel matrix L R K K, where L ij = k(a i, a j ) and k(, ) is a kernel function. Then the probability over a subset of items indexed by S {1,, K} can be defined as p(s) = det(l S) (6.1) det(l + I) where L S [L ij ] i,j S denotes the restriction of L to the entries indexed by elements of S and det( ) denotes the determinant of a matrix and I is an identity matrix. The determinant enables DPP to capture the high-order relations among items. To understand this, we first present the geometry interpretation of det(l S ). According to the kernel trick [83], k(a i, a j ) can be written as φ(a i ) φ(a j ), where φ( ) is a reproducing kernel feature map [83]. Then det(l S ) is essentially the volume of the parallelepiped formed by the vectors {φ(a i ) i S} [51]. The size of the volume is collectively determined by all these vectors in a global way, which hence captures the high-order correlation among them. Another way to understand why determinant entails highorder correlation is to expand det(l S ) as a sum of terms each involving the multiplication of S kernel function values, which hence captures correlations of S -th order. While able to represent high-order correlation, DPP is computationally efficient. DPP s normalizer det(l + I) can be computed in polynomial (cubic) time, as opposed to the exponential complexity in conditional random field. We apply DPP to capture the correlation among medications: given the representations of K medications {a i } K i=1, we compute the kernel matrix L and define probability over medicationsubset according to Eq.(6.1). For medication-medication kernel function k(a i, a j ), we parameterize it using a medication correlation network (MCN) where the inputs are a i and a j and the output is a scalar indicating the correlation of the two medications. To represent medications, we leverage professional medical articles that describe various aspects of the medications, including 45

51 what conditions/diseases the medication can treat, side effects, dosage, and so on. Given the article of a medication, we feed it into a text encoding network (TEN) which produces a representation vector a that is later fed into the MCN. As stated earlier, the selection of medications relies not only on their correlations, but also their dependency relations with the admission note represented with a vector x. We use a medication-note dependency network (MNDN) to define a score function g(a i, x) to measure the dependency between x and medication i. To learn x, we feed the admission note into the TEN, which shares the same architecture and parameters with the TEN used for embedding medication articles. To simultaneously captures medication-note dependency and medication-medication correlation, we incorporate g(a i, x) into the kernel function in DPP. On top of the kernel function k(a i, a j ) measuring the correlation between medication i and j, we define a new kernel k(a i, a j x) = g(a i, x)k(a i, a j )g(a i, x) (6.2) which is conditioned on the input x. Under this conditional kernel parameterized by deep networks, we obtain a deep conditional DPP (DCDPP): p(s x) = det(l S(x)) det(l(x) + I) (6.3) where L ij (x) = k(a i, a j x). Given training data {(d n, S n )} N n=1 where d n is the input note and S n is the subset of medications prescribed at discharge time, we learn the parameters Θ of DCDPP, mainly the weight and bias parameters in DNNs, by maximizing the data likelihood max Θ L({(d n, S n )} N n=1) = N p(s n d n ; Θ) (6.4) n=1 Incorporating Medication Interactions Medical professionals have accumulated rich knowledge regarding the interactions between medications. These interactions largely affect the usage of medications. Specifically, we consider two types of interactions: antagonism and synergy. The antagonism interaction indicates that when used together, two medications may bring in a negative medical effect. Medications with antagonism interactions should be prohibited from being used together. The synergy interaction suggests that two medications are frequently used simultaneously to treat a disease. Their co-occurrence would bring in a positive medical effect and should be encouraged. We aim at incorporating these knowledge to make better predictions. Specifically, we impose relational regularization over DCDPP such that medications with synergy interaction are encouraged to be co-selected and those with antagonism interaction are penalized for co-selection. This regularization approach is designed according to the property of DPP, which assigns larger probability mass p(s) over a medication-subset S where the medications are more mutually different (Figure 6.1). The difference between two medications a i and a j is measured by the kernel function k(a i, a j ): the smaller k(a i, a j ) is, the more different a i and a j are. To encourage label i and j to be simultaneously selected into S, we encourage k(a i, a j ) to be small to increase p(s). 46

52 (a i ) (a j ) " det(l S ) # p(s) # (a i ) Increasing k(a i, a j ) discourages the co-selection of label and j i (a j ) (a i ) (a j ) (a i ) (a j ) # det(l S ) " p(s) " (a i ) (a j ) Decreasing k(a i, a j ) encourages the co-selection of label and j i Figure 6.1: In DPP, the probability of a subset S is proportional to det(l S ), which is the volume of the parallelepiped formed by vectors {φ(a i ), i S}. φ( ) is the reproducing kernel feature map. As we increase the inner product between φ(a i ) and φ(a j ) (which is essentially k(a i, a j )), the volume of the parallelepiped decreases and p(s) decreases, which discourages the co-selection of label i and j. On the contrary, decreasing k(a i, a j ) encourages the two labels to be co-selected. To discourage simultaneous selection, k(a i, a j ) is preferred to be large to decrease p(s). Denoting M and C the set of medication pairs possessing antagonistic and synergistic interactions respectively, we define the following relation-regularized DCDPP (RDCDPP) problem max Θ L({(d n, S n )} N n=1) + λ( k(a i, a j ) + k(a i, a j )) (6.5) (i,j) M (i,j) C In the second term of the objective function, we encourage medication pair (i, j) with synergistic interaction to have smaller k(a i, a j ) and those with antagonistic interaction to have larger k(a i, a j ) Evaluation We evaluate the effectiveness of our model on 8 antihypertensive medications and 25K patientvisits. We compare our DDPP model with logistic regression (LR), support vector machine (SVM), random forest (RF), and convolutional neural network (CNN). Table 6.1 shows the precision (P), recall (R) and F1-score (F) of different methods for each antihypertensive medication. The last two lines show the macro-averaged scores over all medications. As can be seen, DDPP outperforms the baseline models. The major difference between DDPP and CNN is the determinantal point process used for capturing high-order correlations, which is the main reason that DDPP outperforms CNN Proposed: Diversity-Promoting Learning for More Accurate Prediction of Infrequent Medications As can be seen from Table 6.1, the F1 scores on infrequent medications are much worse than those on frequent medications. The possible reason is that: due to the dominance of frequent 47

53 Medication DDPP CNN SVM RF LR P R F P R F P R F P R F P R F Metoprolol Furosemide Lisinopril Amlodipine Atenolol Hctz Diltiazem Carvedilol Macro Avg Table 6.1: Medication-wise precision (P), recall (R) and F1-score (F) for DDPP and 4 baseline models. From top to bottom, medications are shown in descending order of their frequencies. The overall performance is measured using macro-average. medications over infrequent ones, the representation produced by CNN is biased toward the frequent medications. We resort to diversity-promoting learning to address this issue. 6.2 Automatic Generation of Text Reports for Medical Images Medical images, such as radiology and pathology images, are widely used in hospitals and clinics for the diagnosis and treatment of many diseases. The reading and interpretation of medical images are usually conducted by specialized medical professionals. They write textual reports to narrate the findings regarding each area of the body examined in the imaging study, specifically whether each area was found to be normal, abnormal or potentially abnormal. For less-experienced radiologists and pathologists, especially those working in the rural area where the quality of healthcare is relatively low, writing medical-imaging reports is demanding. For experienced radiologists and pathologists, writing imaging reports is tedious and timeconsuming. In nations with a large population such as China, a radiologist may need to read hundreds of radiology images per day. Typing the findings of each image into the computer takes about 5-10 minutes, which occupies most of their working time. In sum, for both unexperienced and experienced medical professionals, writing imaging reports is unpleasant. This motivates us to investigate whether it is possible to automatically generate medical image reports. Several challenges need to be addressed. First, a complete diagnostic report is comprised of multiple heterogeneous forms of information, including sentences, paragraphs, and keywords. Generating this heterogeneous information in a unified framework is technically demanding. Secondly, an imaging report usually focuses more on narrating the abnormal findings since they directly indicate diseases and guide treatment. How to localize image-regions that contain abnormalities and attach the right description to them are challenging. Third, the descriptions in imaging reports are usually long, containing multiple sentences or even multiple paragraphs. Generating such long text is highly nontrivial. 48

Figure 6.2: Illustration of the proposed model. MLC denotes a multi-label classification network. Semantic features are the word embeddings of the predicted tags.

54 Figure 6.2: Illustration of the proposed model. MLC denotes a multi-label classification network. Semantic features are the word embeddings of the predicted tags. The bold tags calcified granuloma and granuloma are attended by the co-attention network Methods A complete diagnostic report for a medical image is comprised of both unstructured descriptions (in the form of sentences and paragraphs) and semi-structured tags (in the form of keyword lists. We propose a multi-task hierarchical model with co-attention for automatically predicting keywords and generating long paragraphs. Given an image which is divided into regions, we use a CNN to learn visual features for these patches. Then these visual features are fed into a multi-label classification (MLC) network to predict the relevant tags. In the tag vocabulary, each tag is represented by a word-embedding vector. Given the predicted tags for a specific image, their word-embedding vectors are retrieved to serve as the semantic features of this image. Then the visual features and semantic features are fed into a co-attention model to generate a context vector that simultaneously captures the visual and semantic information of this image. As of now, the encoding process is completed. Next, starting from the context vector, the decoding process generates the text descriptions. The description of a medical image usually contains multiple sentences, and each sentence focuses on one specific topic. Our model leverages this compositional structure to generate reports in a hierarchical way: it first generates a sequence of high-level topic vectors representing sentences, then generates a sentence (a sequence of words) from each topic vector. Specifically, the context vector is inputted into a sentence LSTM, which unrolls for a few steps, each producing a topic vector. A topic vector represents the semantics of a sentence to be generated. Given a topic vector, the word LSTM takes it as input and generates a sequence of words to form a sentence. The termination of the unrolling process is controlled by the sentence LSTM Evaluation We evaluated our methods on two datasets: IU X-Ray [1] and PEIR Gross [2] which contain images and reports of radiology and pathology respectively. We compare our method with several state-of-the-art image captioning models: CNN-RNN [93], LRCN [28], Soft ATT [120], ATT-RK [122]. We also performed ablation study of our methods and compared with the following ablation baselines: Ours-No-Attention which does not use any attention mechanism; Ours- Semantic-Only which only uses semantic attention; Ours-Visual-Only which only uses the visual 49

55 Dataset Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE CIDER CNN-RNN [93] LRCN [28] Soft ATT [120] IU X-Ray ATT-RK [122] Ours-No-Attention Ours-Semantic-Only Ours-Visual-Only Ours-CoAttention PEIR Gross CNN-RNN [93] LRCN [28] Soft ATT [120] ATT-RK [122] Ours-Semantic-Only Ours-Visual-Only Ours-CoAttention Table 6.2: Performances for paragraph generation on the IU X-Ray dataset (upper part), and single-sentence generation on the PEIR Gross dataset (lower part). BLUE-n denotes the BLEU score that uses up to n-grams. attention. Quantitative results We measure the performance of paragraph generation (upper part of Table 6.2) and single sentence generation (lower part of Table 6.2) using the following metrics: BLEU [74], METEOR [27], ROUGE [63] and CIDER [92]. We make two observations in this table. First, models (e.g., CNN-RNN [93]) with a single-layer LSTM decoder perform much worse than those (e.g., Ours-No-Attention) with a hierarchical LSTM decoder. This result is not surprising since it is well-known that single-layer LSTM cannot effectively model long sequences. Second, employing semantic attention alone (Ours-Semantic-Only) or visual attention alone (Ours-Visual-Only) performs less well than using both (Ours-CoAttention), demonstrating the effectiveness of the proposed co-attention mechanism. Qualitative results An illustration of paragraph generation by three models Ours-CoAttention, Ours-No-Attention and Soft-Attention [120] is shown in Figure 6.3. The underlined sentences are the descriptions of abnormalities. In the paragraphs generated by Ours-CoAttention, the first sentence is usually a high-level description of the image, while each of the following sentences describes one region of the image (e.g. lung, heart, etc.). Besides, it is worth noting that the abnormalities detected by the Soft-Attention and Ours-No-Attention models are incorrect. In contrast, Ours-CoAttention model is able to correctly describe the abnormalities (in the top three images) Proposed: Diversity-Promoting Learning for More Interpretable Report- Generation To make the physicians willing to use this automatic report-generation tool, it is important to make the model interpretable. We apply diversity-promoting regularization to encourage the weight vectors of the hidden units to be distinct for better interpretability. 50

56 Figure 6.3: Exemplar paragraphs generated by Ours-CoAttention, Ours-No-Attention and Soft- Attention [120] models. 6.3 Automatic ICD Code Filling The International Classification of Diseases (ICD) is a healthcare classification system maintained by the World Health Organization [73], which provides a hierarchy of diagnostic codes of diseases, disorders, injuries, signs, symptoms, etc. It is widely used for reporting diseases and health conditions, assisting in medical reimbursement decisions, collecting morbidity and mortality statistics, to name a few. While ICD codes are important for making clinical and financial decisions, medical coding which assigns proper ICD codes to a patient admission is time-consuming, error-prone and expensive. The cost incurred by coding errors and the financial investment spent on improving coding quality is estimated to be $25 billion per year in the US [30, 54]. To reduce coding errors and cost, we aim at building an ICD coding machine which automatically and accurately translates the free-text diagnosis descriptions into ICD codes. To achieve this goal, several technical challenges need to be addressed. First, the diagnosis descriptions written by physicians and the textual descriptions of ICD codes are written in quite different styles even if they refer to the same disease. In particular, the definitions of ICD codes are formally and precisely worded, while diagnosis descriptions are usually written in an informal and ungrammatical way, with telegraphic phrases, abbreviations, and typos. Second, as stated earlier, 51

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are