FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Similar documents
Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Sparse Kernel Machines - SVM

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Survival SVM: a Practical Scalable Algorithm

Building a Prognostic Biomarker

CSC 411: Lecture 03: Linear Classification

Multiple Kernel Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS798: Selected topics in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Distribution-Free Distribution Regression

Neutron inverse kinetics via Gaussian Processes

SVMC An introduction to Support Vector Machines Classification

Introduction to SVM and RVM

Support Vector Machines

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine learning for pervasive systems Classification in high-dimensional spaces

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Learning with Rejection

Sequential Minimal Optimization (SMO)

Unsupervised Anomaly Detection for High Dimensional Data

Relevance Vector Machines

Evaluation. Andrea Passerini Machine Learning. Evaluation

Perceptron Revisited: Linear Separators. Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Evaluation requires to define performance measures to be optimized

Support vector machines Lecture 4

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Classifier Complexity and Support Vector Classifiers

Gaussian with mean ( µ ) and standard deviation ( σ)

Discriminative Learning and Big Data

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Review: Support vector machines. Machine learning techniques and image analysis

Model Selection for Gaussian Processes

Designing Kernel Functions Using the Karhunen-Loève Expansion

Introduction to Support Vector Machines

Nonparametric Bayesian Methods (Gaussian Processes)

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Scaling Neighbourhood Methods

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Medical Question Answering for Clinical Decision Support

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Linear & nonlinear classifiers

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

MTTTS16 Learning from Multiple Sources

Cluster Kernels for Semi-Supervised Learning

Kernel Methods and Support Vector Machines

L5 Support Vector Classification

Support Vector Machine (continued)

Diagnostics. Gad Kimmel

Kernel methods, kernel SVM and ridge regression

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Matrix Factorization and Factorization Machines for Recommender Systems

Statistical learning theory, Support vector machines, and Bioinformatics

Training algorithms for fuzzy support vector machines with nois

Support Vector Machine

Bayesian Support Vector Machines for Feature Ranking and Selection

SMO Algorithms for Support Vector Machines without Bias Term

Lecture 7: Kernels for Classification and Regression

9.2 Support Vector Machines 159

Machine Learning 2010

A Bias Correction for the Minimum Error Rate in Cross-validation

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Chunking with Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Logistic Regression and Boosting for Labeled Bags of Instances

Sparse Proteomics Analysis (SPA)

Statistical Machine Learning from Data

Gaussian Processes (10/16/13)

Basis Expansion and Nonlinear SVM. Kai Yu

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

ECE521 week 3: 23/26 January 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Scalable Joint Modeling of Longitudinal and Point Process Data:

SVMs, Duality and the Kernel Trick

Kernel Methods & Support Vector Machines

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Linear & nonlinear classifiers

A Simple Algorithm for Multilabel Ranking

Support Vector Machines

Kaggle.

Non-linear Support Vector Machines

Prediction of double gene knockout measurements

Performance Evaluation

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Statistical Methods for SVM

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Transcription:

SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL SCIENCE AND ENGINEERING GEORGIA INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE IN MEDICINE 2015 WORKSHOP ON MATRIX COMPUTATIONS FOR BIOMEDICAL INFORMATICS

Talk Outline

Talk Outline THE PROBLEM: PREDICT TYPE 2 DIABETES

Talk Outline THE PROBLEM: PREDICT TYPE 2 DIABETES USING ELECTRONIC HEALTH RECORDS

Talk Outline THE PROBLEM: PREDICT TYPE 2 DIABETES USING ELECTRONIC HEALTH RECORDS THE METHOD: FACTORIZATION MACHINES

Healthcare Data Challenges

Healthcare Data Challenges HETEROGENEOUS, HIGH-DIMENSIONAL Diagnoses Medications Genomic

Healthcare Data Challenges HETEROGENEOUS, HIGH-DIMENSIONAL NOISY

Healthcare Data Challenges HETEROGENEOUS, HIGH-DIMENSIONAL NOISY SPARSE

Type 2 Diabetes Mellitus

Type 2 Diabetes Mellitus CURRENTLY: MORE THAN 3M CASES PER YEAR IN THE US

Type 2 Diabetes Mellitus CURRENTLY: MORE THAN 3M CASES PER YEAR IN THE US EARLY DETECTION AND TREATMENT DECREASES THE RISK OF DEVELOPING DIABETES COMPLICATIONS

Type 2 Diabetes Mellitus CURRENTLY: MORE THAN 3M CASES PER YEAR IN THE US TARGET OF THIS WORK: ACCURATE CLASSIFICATION OF DIABETIC PATIENTS, GIVEN THEIR CLINICAL STATUS

Model learning from sparse data LINEAR REGRESSION y(x) =w 0 + p P w i x i w 0 2 R, w 2 R p

Model learning from sparse data LINEAR REGRESSION OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + p P w i x i w 0 2 R, w 2 R p

Model learning from sparse data POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p

Model learning from sparse data POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p 1) VERY LARGE NUMBER OF PARAMETERS

Model learning from sparse data POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p 1) VERY LARGE NUMBER OF PARAMETERS e.g. 1000 features => more than 1Mil parameters!

Model learning from sparse data POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p 2) SPARSE DATASET (=> FEW AVAILABLE FEATURES / PATIENT )

Model learning from sparse data POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p 2) SPARSE DATASET (=> FEW AVAILABLE FEATURES / PATIENT ) Estimate each interaction parameter INDEPENDENTLY from limited data

Model learning from sparse data Similar issues with Poly-Kernel SVM POLYNOMIAL REGRESSION (d=2) OFTEN INTERESTED IN FEATURE INTERACTIONS (e.g. how diagnostic conditions JOINTLY affect the target prediction) y(x) =w 0 + w i x i {z } linear terms + j i W i,j x i x j {z } non-linear terms w 0 2 R, w 2 R p, W 2 R p p 2) SPARSE DATASET (=> FEW AVAILABLE FEATURES / PATIENT ) Estimate each interaction parameter INDEPENDENTLY from limited data

Factorization Machines [Rendle 2010, 2012]

Factorization Machines [Rendle 2010, 2012] Very successful in predicting your next movie ratings!

Factorization Machines (d=2) [Rendle 2010, 2012] w 0 2 R, w 2 R p, V 2 R p k y(x) =w 0 + w i x i + hv i,v j ix i x j j>i Represent each feature via a vector (length k) => low-rank matrix V Interaction between features i, j expressed through: <V i, V j > Weights are now dependent: <V i, V j >, <V i, V l > share V i Thus: estimation of one interaction aids in estimating related interactions

Factorization Machines (d=2) [Rendle 2010, 2012] w 0 2 R, w 2 R p, V 2 R p k 0 kx y(x) =w 0 + w i x i @ + 1 2 f=1 V i,f x i! 2 V 2 i,f x 2 i 1 A Represent each feature via a vector (length k) => low-rank matrix V Interaction between features i, j expressed through: <V i, V j > Weights are now dependent: <V i, V j >, <V i, V l > share V i Thus: estimation of one interaction aids in estimating related interactions Flexible form allows LINEAR computation of model equation in both k and p

Factorization Machines (d=2) [Rendle 2010, 2012] w 0 2 R, w 2 R p, V 2 R p k 0 kx y(x) =w 0 + w i x i @ + 1 2 f=1 V i,f x i! 2 V 2 i,f x 2 i 1 A Represent each feature via a vector (length k) => low-rank matrix V Interaction between features i, j expressed through: <V i, V j > Weights are now dependent: <V i, V j >, <V i, V l > share V i Thus: estimation of one interaction aids in estimating related interactions Flexible form allows LINEAR computation of model equation in both k and p By construction: #params << p 2 of Polynomial Regression!

Experiments Dataset released from Practice Fusion Task: diabetes classification 9948 patients, 1904 of them are diabetic 702 FEATURES: Diagnoses (first-3 ICD9 digits), sex, year of birth Sparse dataset: 1% non-zero values libfm package for Factorization Machines MCMC as the learning method (requires the least number of hyperparameters) Competing methods (Matlab built-in): SVM with gaussian kernel, Random Forests All parameters selected via cross-validation

Experiments - Prediction Accuracy F1-Score Accuracy Precision Recall AUC Factorization Machines 0.3302 0.8187 0.5656 0.2334 0.8048 SVM 0.2046 0.8125 0.5438 0.1263 0.7598 Random Forests 0.2069 0.8178 0.6232 0.1242 0.8048 Results on detecting Type 2 Diabetes Mellitus, 5-fold CV, low rank of FMs k=4 FMs outperform state-of-the-art classifiers in terms of F1-Score Approx. same performance with best alternative over Accuracy, AUC

Experiments - Selection of model dimensionality 0.57 0.56 Average precision 0.55 0.54 0.53 0.52 0.51 0 5 10 15 20 25 30 35 40 45 50 55 60 65 Model dimensionality Average precision of FM of increasing model complexity (varying k = {1, 2, 4, 8, 16, 32, 64})

Experiments - Scalability to the model dimensionality Average time per fold (seconds) 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 Model dimensionality E ect of increasing model complexity (varying k = {1, 2, 4, 8, 16, 32, 64}) on the FM training time

Experiments - Scalability to the sample size 3 Training time (seconds) 2.5 2 1.5 1 0.5 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 Size of training data (#samples) E ect of increasing dataset size on the FM training time (fixed k = 4)

Concluding remarks - Future work FACTORIZATION MACHINES: efficient model learning method for sparse EHR datasets HEALTHCARE DATA ARE INHERENTLY SPARSE: patients have to visit physician so that a diagnosis is recorded NEXT: incorporation of more data sources Take domain knowledge into account Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL SCIENCE AND ENGINEERING GEORGIA INSTITUTE OF TECHNOLOGY SunLab Enlighten the World