lcda: Local Classification of Discrete Data by Latent Class Models

Similar documents
The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Determining the number of components in mixture models for hierarchical data

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Nonparametric Bayesian Methods (Gaussian Processes)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

International Journal "Information Theories & Applications" Vol.14 /

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Learning. Bayesian Learning Criteria

Multivariate statistical methods and data mining in particle physics

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Applied Multivariate and Longitudinal Data Analysis

Knowledge Discovery and Data Mining

The Minimum Message Length Principle for Inductive Inference

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Support Vector Machines

Latent Class Analysis

Pattern Recognition and Machine Learning

Independent Factor Discriminant Analysis

Variable selection for model-based clustering

Bayesian Inference of Interactions and Associations

A class of latent marginal models for capture-recapture data with continuous covariates

Regression tree methods for subgroup identification I

Contents. Part I: Fundamentals of Bayesian Inference 1

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gaussian Models

Notes on Discriminant Functions and Optimal Classification

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data

CSCI-567: Machine Learning (Spring 2019)

Does Modeling Lead to More Accurate Classification?

Mixtures of Rasch Models

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

A Study of Relative Efficiency and Robustness of Classification Methods

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 9: Classification, LDA

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 9: Classification, LDA

Lecture 5: LDA and Logistic Regression

CS145: INTRODUCTION TO DATA MINING

Latent class analysis and finite mixture models with Stata

Building a Prognostic Biomarker

Chapter 6: Classification

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Generative MaxEnt Learning for Multiclass Classification

Reliability Monitoring Using Log Gaussian Process Regression

Naïve Bayes classification

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

STA 216, GLM, Lecture 16. October 29, 2007

Nonparametric Bayes tensor factorizations for big data

CS145: INTRODUCTION TO DATA MINING

Sparse Kernel Machines - SVM

Natural Language Processing (CSEP 517): Text Classification

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Bayes methods for categorical data. April 25, 2017

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Ensemble Methods and Random Forests

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Probabilistic Graphical Models

Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

The Laplace Approximation

Brief Introduction of Machine Learning Techniques for Content Analysis

Jeffrey D. Ullman Stanford University

STA 4273H: Statistical Machine Learning

What is Latent Class Analysis. Tarani Chandola

STA 4273H: Statistical Machine Learning

Classification objectives COMS 4771

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probability Review. September 25, 2015

Introduction to Machine Learning

Classification 1: Linear regression of indicators, linear discriminant analysis

Data classification (II)

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

LDA, QDA, Naive Bayes

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Adaptive Crowdsourcing via EM with Prior

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Linear Models for Classification

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Algorithmisches Lernen/Machine Learning

Reduced Error Logistic Regression

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Transcription:

lcda: Local Classification of Discrete Data by Latent Class Models Michael Bücker buecker@statistik.tu-dortmund.de July 9, 2009

Introduction common global classification methods may be inefficient when groups are heterogenous need for more flexible, local models continuous models that allow for subclasses: Mixture Discriminant Analysis (MDA): assumption of class conditional mixtures of (multivariate) normals Common Components (Titsias and Likas 2001) imply a mixture of normals with common components in this talk: discrete counterparts based on Latent Class Models (see Lazarsfeld and Henry 1968) implemented in R-package lcda application to SNP data user! 2009 1

Local structures user! 2009 2

Mixture Discriminant Analysis and Common Components class conditional density (MDA) f(x Z = k) = f k (x) = M k m=1 w mk φ(x; µ mk, Σ) user! 2009 3

Mixture Discriminant Analysis and Common Components class conditional density (MDA) f(x Z = k) = f k (x) = M k m=1 w mk φ(x; µ mk, Σ) class conditional density of the Common Components Model (Titsias and Likas 2001) M P (X = x Z = k) = f k (x) = w mk φ(x; µ m, Σ) m=1 user! 2009 3

Mixture Discriminant Analysis and Common Components class conditional density (MDA) f(x Z = k) = f k (x) = M k m=1 w mk φ(x; µ mk, Σ) class conditional density of the Common Components Model (Titsias and Likas 2001) M P (X = x Z = k) = f k (x) = w mk φ(x; µ m, Σ) posterior based on Bayes rule m=1 P (Z = k X = x) = π kf k (x) K l=1 π lf l (x) user! 2009 3

Latent Class Model latent (unobservable) variable Y with categorical outcomes in {1,..., M} with probability P (Y = m) = w m user! 2009 4

Latent Class Model latent (unobservable) variable Y with categorical outcomes in {1,..., M} with probability P (Y = m) = w m manifest (observable) variables X 1,..., X D, X d with outcomes in {1,..., R d } with probability P (X d = r Y = m) = θ mdr user! 2009 4

Latent Class Model latent (unobservable) variable Y with categorical outcomes in {1,..., M} with probability P (Y = m) = w m manifest (observable) variables X 1,..., X D, X d with outcomes in {1,..., R d } with probability P (X d = r Y = m) = θ mdr define X dr = 1 if X d = r and X dr = 0 else and assume stochastic independence of manifest variables conditional on Y, then the conditional probability mass function is given by f(x m) = D R d d=1 r=1 θ x dr mdr user! 2009 4

Latent Class Model latent (unobservable) variable Y with categorical outcomes in {1,..., M} with probability P (Y = m) = w m manifest (observable) variables X 1,..., X D, X d with outcomes in {1,..., R d } with probability P (X d = r Y = m) = θ mdr define X dr = 1 if X d = r and X dr = 0 else and assume stochastic independence of manifest variables conditional on Y, then the conditional probability mass function is given by f(x m) = D R d d=1 r=1 θ x dr mdr unconditional probability mass function of manifest variables is f(x) = M m=1 w m D R d d=1 r=1 θ x dr mdr user! 2009 4

Proposition 1. The LCM f(x) = M Identifiability D R d w m m=1 d=1 r=1 θ x dr mdr is not identifiable. user! 2009 5

Proposition 1. Proof. The LCM f(x) = M Identifiability D R d w m m=1 d=1 r=1 θ x dr mdr is not identifiable. the LCM is a finite mixture of products of multinomial distributions each mixture component f(x m) is the product of M(1, θ md1,..., θ mdrd )- distributed random variables mixtures of M multinomials M(N, θ 1,..., θ p ) are identifiable iff N 2M 1 (Elmore and Wang 2003) mixtures of the product of marginal distributions are identifiable if mixtures of the marginal distributions are identifiable (Teicher 1967) the LCM is not identifiable. user! 2009 5

Estimation of the LCM estimation by EM-algorithm: user! 2009 6

Estimation of the LCM estimation by EM-algorithm: E step: Determination of conditional expectation of Y given X = x τ mn = w mf(x n m) f(x n ) user! 2009 6

Estimation of the LCM estimation by EM-algorithm: E step: Determination of conditional expectation of Y given X = x τ mn = w mf(x n m) f(x n ) M step: Maximization of the log-likelihood and estimation of N w m = 1 N n=1 τ mn and θ mdr = 1 Nw m N τ mn x ndr n=1 user! 2009 6

Model selection criteria information criteria AIC BIC where η = M 2 log L(w, θ x) + 2η 2 log L(w, θ x) + η log N ( D ) d=1 R d D + 1 1 (=number of parameters) goodness-of-fit test statistics (predicted vs. observed frequencies) Pearson s χ 2 likelihood ratio χ 2 user! 2009 7

Local Classification of Discrete Data two ways to use LCM for local classification: class conditional mixtures (like in MDA) common components user! 2009 8

Local Classification of Discrete Data two ways to use LCM for local classification: class conditional mixtures (like in MDA) common components class conditional mixtures P (X = x Z = k) = f k (x) = M k m=1 w mk D R d d=1 r=1 θ x kdr mkdr, user! 2009 8

Local Classification of Discrete Data two ways to use LCM for local classification: class conditional mixtures (like in MDA) common components class conditional mixtures P (X = x Z = k) = f k (x) = M k m=1 w mk D R d d=1 r=1 θ x kdr mkdr, common components P (X = x Z = k) = f k (x) = M m=1 w mk D R d d=1 r=1 θ x dr mdr, user! 2009 8

Estimation of a common components model (option 1) let π k be the class prior, then P (X = x) = = K k=1 M m=1 π k w m M m=1 D w mk R d d=1 r=1 D R d d=1 r=1 θ x dr mdr θ x dr mdr since w m := P (m) = K P (k)p (m k) = K π k w mk k=1 k=1 user! 2009 9

Estimation of a common components model (option 1) let π k be the class prior, then P (X = x) = = K k=1 M m=1 π k w m M m=1 D w mk R d d=1 r=1 D R d d=1 r=1 θ x dr mdr θ x dr mdr since w m := P (m) = K P (k)p (m k) = k=1 this is a common Latent Class Model K π k w mk hence, estimate a global Latent Class model and determine parameter w mk of the common components model by k=1 ŵ mk = 1 N k N k i=1 ˆP (Y = m Z = k, X = x i ) user! 2009 9

Estimation of a common components model (option 2) E step: Determination of conditional expectation τ mkn = w mkf(x n m) f(x n ) user! 2009 10

Estimation of a common components model (option 2) E step: Determination of conditional expectation τ mkn = w mkf(x n m) f(x n ) M step: Maximization of the log-likelihood and estimation of w mk = 1 N k N k n=1 τ mkn and θ mdr = K k=1 1 N k w mk N k n=1 τ mkn x ndr user! 2009 10

Classification capability in Common Components Models measure for the ability to separate classes adequately impurity measures handling the subgroups like nodes in decision trees user! 2009 11

Classification capability in Common Components Models measure for the ability to separate classes adequately impurity measures handling the subgroups like nodes in decision trees standardized mean entropy M H = K w m m=1 k=1 P (k m) log K (P (k m)) user! 2009 11

Classification capability in Common Components Models measure for the ability to separate classes adequately impurity measures handling the subgroups like nodes in decision trees standardized mean entropy M H = K w m m=1 k=1 P (k m) log K (P (k m)) mean Gini impurity G = M m=1 w m [1 ] K (P (k m)) 2 k=1 user! 2009 11

Implementation in R Package: lcda (requires polca, scatterplot3d and MASS) main functions: lcda, cclcda, cclcda2 syntax like lda(mass) (including predict method) example: lcda(x,...) ## Default S3 method: lcda(x, grouping=null, prior=null, probs.start=null, nrep=1, m=3, maxiter = 1000, tol = 1e-10, subset, na.rm = FALSE,...) user! 2009 12

Application: simulation study intention: discrete MDA can be seen as localized Naive Bayes, it assumes local independence instead of global independence simulation of data by the discrete MDA model with and without existing subgroups probabilities θ mkdr are defined in a way so that the subgroups are not existent in the case of existing subgroups discrete MDA classifies more adequately than Naive Bayes otherwise discrete MDA and Naive Bayes lead to the same decision user! 2009 13

Application: SNP data GENICA study: aims at identifying genetic and gene-environment associated breast cancer risks 1166 observations, 605 controls and 561 cases, of 68 SNP variables and 6 categorical epidemiological variables application of the presented local classification methods comparison to the classification results of Schiffner et al. (2009) on the same data set with localized logistic regression CART random forests logic regression logistic regression user! 2009 14

Results: SNP-data Table 1: Tenfold cross-validated error rates of the presented methods (with number of subclasses in parentheses) method 10 cv error (sd) lcda (10/10) 0.220 (0.030) cclcda (4) 0.345 (0.056) cclcda2 (10) 0.471 (0.049) Table 2: Tenfold cross-validated error rates as noted in Schiffner et al. (2009) method 10 cv error localized logistic regression 0.367 CART 0.379 random forests 0.382 logic regression 0.385 logistic regression 0.366 user! 2009 15

Conclusion three models based on Latent Class Analysis that provide a flexible approach to local classification the models can handle missing values without imputation discrete MDA can be seen as a localized version of the Naive Bayes method further research: extend the methods to data of mixed type by assuming normality of the continuous variables user! 2009 16

References R. Elmore and S. Wang. Identifiability and estimation in finite mixture models with multinomial components. Technical Report 03 04, Department of Statistics, Pennsylvania State University, 2003. P.F. Lazarsfeld and N.W. Henry. Latent structure analysis. Houghton Miflin, Boston, 1968. J. Schiffner, G. Szepannek, Th. Monthé, and C. Weihs. Localized Logistic Regression for Categorical Influential Factors. To appear in A. Fink, B. Lausen, W. Seidel and A. Ultsch, editors, Advances in Data Analysis, Data Handling and Business Intelligence. Springer-Verlag, Heidelberg-Berlin, 2009. H. Teicher. Identifiability of mixtures of product measures. The Annals of Mathematical Statistics, 38:1300 1302, 1967. M.K. Titsias and A.C. Likas. Shared kernel models for class conditional density estimation. IEEE Transactions on Neural Networks, 12:987 997, 2001. user! 2009 17