odeling atient ortality from linical ote

Similar documents
CS Lecture 18. Topic Models and LDA

Applying LDA topic model to a corpus of Italian Supreme Court decisions

CS145: INTRODUCTION TO DATA MINING

Lecture 13 : Variational Inference: Mean Field Approximation

Generative Clustering, Topic Modeling, & Bayesian Inference

Latent Dirichlet Allocation

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation Introduction/Overview

Medical Question Answering for Clinical Decision Support

Study Notes on the Latent Dirichlet Allocation

Information Extraction from Text

Probabilistic Machine Learning. Industrial AI Lab.

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Topic Modelling and Latent Dirichlet Allocation

Feature Engineering, Model Evaluations

Sparse Stochastic Inference for Latent Dirichlet Allocation

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Measuring Topic Quality in Latent Dirichlet Allocation

Applying hlda to Practical Topic Modeling

CSE 473: Artificial Intelligence Autumn Topics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Topic Models and Applications to Short Documents

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

Least Squares Regression

Least Squares Regression

Kernel Density Topic Models: Visual Topics Without Visual Words

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Machine Learning, Fall 2012 Homework 2

Language Information Processing, Advanced. Topic Models

Online Bayesian Passive-Agressive Learning

Gaussian Mixture Model

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to AI Learning Bayesian networks. Vibhav Gogate

INTERPRETING THE PREDICTION PROCESS OF A DEEP NETWORK CONSTRUCTED FROM SUPERVISED TOPIC MODELS

Introduction to Machine Learning

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Topic Models. Charles Elkan November 20, 2008

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture 6: Neural Networks for Representing Word Meaning

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 8: Graphical models for Text

Which pages are those of job ads?

Latent Dirichlet Allocation Based Multi-Document Summarization

ECE521 week 3: 23/26 January 2017

Probabilistic Latent Semantic Analysis

ECE 5984: Introduction to Machine Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Improving Topic Models with Latent Feature Word Representations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Content-based Recommendation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Introduction to Machine Learning Midterm Exam Solutions

Using Both Latent and Supervised Shared Topics for Multitask Learning

Linear classifiers: Logistic regression

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Incorporating Social Context and Domain Knowledge for Entity Recognition

Naïve Bayes, Maxent and Neural Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning, Fall

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Web Search and Text Mining. Lecture 16: Topics and Communities

CS6220: DATA MINING TECHNIQUES

Click Prediction and Preference Ranking of RSS Feeds

Supervised topic models for clinical interpretability

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Probability Review and Naïve Bayes

CS6220: DATA MINING TECHNIQUES

Notes on Noise Contrastive Estimation (NCE)

Hierachical Name Entity Recognition

CS60021: Scalable Data Mining. Large Scale Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Non-Parametric Bayes

Classical Predictive Models

Probabilistic Machine Learning

Stochastic gradient descent; Classification

Introduction to Probabilistic Machine Learning

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

STA 414/2104: Lecture 8

Learning Bayesian Networks for Biomedical Data

Dirichlet Enhanced Latent Semantic Analysis

Topic Modeling Using Latent Dirichlet Allocation (LDA)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Neural Networks: Backpropagation

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Linear Models for Regression CS534

Latent variable models for discrete data

Formal Ontology Construction from Texts

Transcription:

odeling atient ortality from linical ote M P M C N ombining opic odeling and ntological eature earning with roup egularization for ext lassification C T M G O F T C eong in ee, harmgil ong, and ilos auskrecht J M C H M H

Outline Background Objective Approach: DA and Concept Mapping Data Extraction Model Building Experiment Conclusion

ackground B ortality of patient in ntensive are nit ( ) is an important index since it shows a patient's severity linical progress note, written by physicians and nurses, contains information about physiology of patient in detail ut the note is unstructured free text and building a model with it is challenging task since it resides in high dimensional vector space M C B I C U ICU

Objective Make a model that learns to predict (classify) mortality of patient from text data X: Clinical note, represented as bag of word vectors Y: Binary label of whether patient died or not during a hospital admission

ord as vector W ords can be represented as vectors he bag of word feature (also called one hot vector) epresent every word as an d 1 with all 0s and one 1 he 1 is at the index of that word in the corpus d: number of all words in corpus W T T

Example of bag of word feature et's represent words in a sentence: The fox jumped over the lazy dog the : [1,0,0,0,0,0,0] f ox : [0,1,0,0,0,0,0] jumped : [0,0,1,0,0,0,0] over : [0,0,0,1,0,0,0] the : [0,0,0,0,1,0,0] lazy : [0,0,0,0,0,1,0] dog : [0,0,0,0,0,0,1]

ower dimensional Feature Generation atent Dirichlet Allocation Ontological Feature Mapping

atent Dirichlet Allocation (DA)

Generative Model DA is a generative model Best known as finding hidden topics from document of a corpus

enerative ( robabilistic) odel G P M bserved random variable is assumed to be generated from some probability distribution o knowing the type and the parameter of the probability distribution amounts to knowing the model that generated observed data O S

As generative model DA models Topics as Dirichlet random variable Assignment of each word to topic: multinomial random variable Document exhibits multiple topics: Dirichlet random variable of topics

n other words, each element in I DA is modeled like following ach document = a mixture of corpus wide topics ach topic = a distribution over words ach word = drawn from one of the topics E E E e from D. Blei, et al., atent Dirichlet Allocation (2003) Figur

DA oal: nfer the underlying topic and its structure with words and documene etsa. a e c e ca ef G Figur rom I. D Bl i, t l., t nt Diri hl t Allo tion (2003)

DA on Clinical Notes Number of topics: 50 Model learned with Gibbs sampling

Concept Feature Generation with Ontology

Concepts in Ontology Knowledge entity in ontology We use clinical ontology and concepts are medical entities such as drug, disease, lab test, etc. egard ontology, we used SNOMED CT (you can think of a medical dictionary)

e from http://data.press.net/ Figur

Concept Mapping Process The concept feature consists of concepts matched with clinical text matched UMS (Unified Medical anguage System) concept text mapping framework, Meta Map, is used to matching concept between clinical text and SNOMED CT ontology

estrict matching concept to be in specific semantic types: Anatomical structure, clinical drug, diagnostic procedure, disease or syndrome, pharmacologic substance, sign or symptom, etc emoved concepts occurred less than 20 times esulting number of concept feature is 118

Data Extraction MIMIC 3 Database, publicly available de identified medical record in critical care dataset It contains medical records of ~46,000 patients who hospitalized in Beth Israel Deaconess hospital between 2001 and 2012

Data Extraction (cont'd) abels were created regarding whether a patient died while in the first hospital admission or not High class imbalance problem Only 10.3 percent patients died in the first hospital admission Subsampled data by removing negative class sample randomly (1 2 ratio)

Data Extraction (cont'd) We also limited patient to be between the age of 18 and 99 Notes to be categories of nursing and physician notes emoved all notes written on the day of death or discharge The resulting samples then consist of 5334 patients

Experiment Classification on different feature sets

Classification with egularization We find a compact representation by sparse group regularization on logistic regression We tried different regularization methods to see their effect in classification Note that the objective functions are the negative log likelihood loss with a class encoding of 1 and +1

1 egularization 1 regularization selects features by shrinking coefficient of certain features while solving linear square problem loss = 1 m m i=1 log(1 + exp( y (x w + c))) + λ w i i T 1 where w = 1 j w j

Elastic Net egularization Elastic net is a regularization method using both 1 and 2. loss = 1 m m i=1 log(1 + exp( y (x w + c)))) + λ w + λ w i i T 1 1 2 2 where w = (w ) and w = w 2 j 2 j j 1 j

Sparse Group egularization It penalizes model parameter w with the pre defined structure of features formed as groups with 2 regularization in addition to 1 regularization for each feature. So parameters in same feature set are penalized together with the same amount of penalty. loss = 1 m m i=1 log(1 + exp( y (x w + c))) i i T +λ w + λ b w 1 1 2 w denotes non overlapping groups of features b denotes weight for group j j G j g j=1 j G j 2

esult

ndividual eature ets I F eature set ord ord ord oncept oncept oncept opic opic opic F W W W C C C T T T S egularization o egularization (1) N (2) 1 (3) (1) lastic et ( 1+ 2) o egularization E N N (2) 1 (3) (1) lastic et ( 1+ 2) o egularization E N N (2) 1 (3) lastic et ( 1+ 2) E N onzero feature size N AUC 10229 0.8304 353 0.8561 307 0.8561 118 0.6488 52 0.6557 50 0.6547 50 0.8720 45 0.8725 43 0.8727

Mixed Feature Sets

Topic Proportion Changes over Days to Death

Conclusion Features obtained from medical ontologies and topic modeling improves the modeling of patient mortality using the clinical note It shows us the combination of semantically enriched concepts from ontology and probabilistically dimensionality reduction methods has potential to enhance tasks that have done before separately We also observed that regularizations have effect on improving classification especially when the dimensionality of feature is high by making the representation of feature sparser

Thanks

BACKUP SIDES

DA in more

Notations The corpus consists of D documents and each document is represented as a vector of size N in which a value of n th element present frequency of the n th word in the dth document. D : Dimensionality of documents, d = 1... D N : Dimensionality of words in corpus, n = 1... N K : Dimensionality of topics in corpus, k = 1... K

We will use following notations to describe the DA model W : Observed word Z : Per word topic assignment θ : Per document topic proportions d d,n d,n β : Per corpus topic distribution k α : Dirichlet parameter for θ η : Dirichlet parameter for β

Generative Process The generative process of DA is For each document d in corpus D Choose θ ~ Dir(α) For each of N words w : Choose a topic z ~ Multinomial(θ) n Choose a word w from p(w z, β), a multinomial probability conditioned on the topic z n n n n DA considers each document as a mixture of corpus wide topics β using topic proportions θ Each word W in each document is generated by the topic proportions θ of the document n

esults

ixed eature ets: opic + oncept M F S T egularization o egularization (1) N (2) 1 (3) (4) (4) (4) lastic et ( 1+ 2) parse roup egularization parse roup egularization (2nd group 0.5) parse roup egularization (2nd group +2.0) E N S G S G S G C onzero feature size N AUC 168 0.8703 63 0.8766 60 0.8758 63 0.8766 64 0.8776 63 0.8752

ixed eature ets: ord + oncept M F S W egularization o egularization (1) N (2) 1 (3) (4) (4) (4) lastic et ( 1+ 2) parse roup egularization parse roup egularization (2nd group 0.5) parse roup egularization (2nd group +2.0) E N S G S G S G C onzero feature size N AUC 10317 0.8416 375 0.8632 328 0.8631 375 0.8632 602 0.8704 1291 0.8669

ixed eature ets: ord + opic M F S W egularization o egularization (1) N (2) 1 (3) (4) (4) (4) lastic et ( 1+ 2) parse roup egularization parse roup egularization (2nd group 0.5) parse roup egularization (2nd group +2.0) E N S G S G S G T onzero feature size N AUC 10279 0.8378 255 0.8873 206 0.8872 255 0.8873 461 0.8859 6889 0.8875

ixed eature ets: ord + opic + oncept M F S W egularization o egularization (1) N (2) 1 (3) (4) (4) (4) lastic et ( 1+ 2) parse roup egularization parse roup egularization (2nd group 0.5) parse roup egularization (2nd group +2.0) E N S G S G S G T C onzero feature size N AUC 10397 0.8498 278 0.8925 222 0.8919 278 0.8925 491 0.8916 285 0.8927