Contingency Table Analysis via Matrix Factorization

Similar documents
COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Machine Learning (BSMC-GA 4439) Wenke Liu

EUSIPCO

Dimension Reduction and Iterative Consensus Clustering

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Consensus NMF: A unified approach to two-sided testing of micro array data

Non-Negative Factorization for Clustering of Microarray Data

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

CS570 Introduction to Data Mining

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Grouping of correlated feature vectors using treelets

Data Mining and Matrices

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Non-Negative Matrix Factorization

Nonnegative Matrix Factorization

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

A Hierarchical Perspective on Lee-Carter Models

Ordinal Variables in 2 way Tables

GAUSS-SEIDEL BASED NON-NEGATIVE MATRIX FACTORIZATION FOR GENE EXPRESSION CLUSTERING

Freeman (2005) - Graphic Techniques for Exploring Social Network Data

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality Reduction for Exponential Family Data

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

ORIE 4741 Final Exam

Similarity and Dissimilarity

Section 29: What s an Inverse?

What is a Decision Problem?

Testing for group differences in brain functional connectivity

ECE 501b Homework #6 Due: 11/26

Chapter 2 Class Notes

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Research Methods in Environmental Science

Goodness of Fit Tests

Latent Semantic Analysis. Hongning Wang

4 Bias-Variance for Ridge Regression (24 points)

MODULE 9 NORMAL DISTRIBUTION

Data Mining 4. Cluster Analysis

Singular Value Decompsition

a Short Introduction

Principal Component Analysis, A Powerful Scoring Technique

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Basics of Multivariate Modelling and Data Analysis

Machine Learning (Spring 2012) Principal Component Analysis

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

MLCC 2015 Dimensionality Reduction and PCA

Practical Statistics for the Analytical Scientist Table of Contents

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

L 2,1 Norm and its Applications

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Notes on Latent Semantic Analysis

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Advanced Statistical Methods: Beyond Linear Regression

Image Registration Lecture 2: Vectors and Matrices

Embeddings Learned By Matrix Factorization

Lecture 5: ANOVA and Correlation

Structural Learning and Integrative Decomposition of Multi-View Data

2.3. Clustering or vector quantization 57

Dimension Reduction and Classification Using PCA and Factor. Overview

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Data Mining Techniques

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Multilevel Analysis, with Extensions

Quick Introduction to Nonnegative Matrix Factorization

Collaborative Filtering. Radek Pelánek

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Functional SVD for Big Data

A Modular NMF Matching Algorithm for Radiation Spectra

Sample Size and Power Considerations for Longitudinal Studies

7 Principal Component Analysis

Using Matrix Decompositions in Formal Concept Analysis

Robust Asymmetric Nonnegative Matrix Factorization

A primer on matrices

Chapter 4: Factor Analysis

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Multiple Correspondence Analysis

Lectures of STA 231: Biostatistics

SVD, discrepancy, and regular structure of contingency tables

Chapter 7. Scatterplots, Association, and Correlation. Copyright 2010 Pearson Education, Inc.

14 Singular Value Decomposition

Robustness of Principal Components

How to analyze multiple distance matrices

Quantitative Understanding in Biology Principal Components Analysis

Problem Set #2 Due: 1:00pm on Monday, Oct 15th

Chapter 26: Comparing Counts (Chi Square)

Collection of Biostatistics Research Archive

Parallel Singular Value Decomposition. Jiaxing Tan

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

These Choice Boards cover the fourteen units for eighth grade Common Core math.

Unsupervised machine learning

Data Mining Techniques

Factor Analysis (10/2/13)

Hypothesis Testing: Chi-Square Test 1

Transcription:

Contingency Table Analysis via Matrix Factorization Kumer Pial Das 1, Jay Powell 2, Myron Katzoff 3, S. Stanley Young 4 1 Department of Mathematics,Lamar University, TX 2 Better Schooling Systems, Pittsburgh, PA 3 Center for Disease Control-Retired 4 National Institute of Statistical Sciences May 10, 2013 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 1

Outline Introduction 1 Introduction 2 3 4 5 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 2

One of the goals of the OCER working group "Singular value decompositions can be used to simultaneously cluster the rows (patients) and columns (variables) of a two-way table. New research indicates that non-negative matrix factorization, NMF, has distinct advantages in the identification of subgroups with distinct etiologies, Lee and Seung (1999). Scores computed via NMF might be better clustering variables for Local Control analysis." SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 3

Introduction Contingency tables of numeric data are often analyzed using dimension reduction methods like the singular value decomposition (SVD), and principal component analysis (PCA). This analysis produces score and loading matrices representing the rows and the columns of the original table and these matrices may be used for both prediction purposes and to gain structural understanding of the data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 4

Introduction Contingency tables of numeric data are often analyzed using dimension reduction methods like the singular value decomposition (SVD), and principal component analysis (PCA). This analysis produces score and loading matrices representing the rows and the columns of the original table and these matrices may be used for both prediction purposes and to gain structural understanding of the data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 4

Extracting factors from a contingency table IJ Good (1969) proposed using SVD to extract factors from a contingency table. We think Non-negative Matrix Factorization (NMF) will likely produce better results and interpretation. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 5

Extracting factors from a contingency table IJ Good (1969) proposed using SVD to extract factors from a contingency table. We think Non-negative Matrix Factorization (NMF) will likely produce better results and interpretation. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 5

Non-negative Matrix Factorization In some tables, the data entries are necessarily non-negative, and so the matrix factors meant to represent them should arguably also contain only non-negative elements. We describe here the use of NMF, an algorithm based on decomposition by parts that can reduce the dimension of data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 6

Non-negative Matrix Factorization In some tables, the data entries are necessarily non-negative, and so the matrix factors meant to represent them should arguably also contain only non-negative elements. We describe here the use of NMF, an algorithm based on decomposition by parts that can reduce the dimension of data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 6

Non-negative Matrix Factorization Let A be a n p non-negative matrix, and k > 0 an integer. NMF consists in finding an approximation, A WH where W, H are n k and k p non-negative matrices, respectively. Columns of W are the underlying basis vectors. Columns of H give the weights associated with each basis vector. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 7

Non-negative Matrix Factorization Let A be a n p non-negative matrix, and k > 0 an integer. NMF consists in finding an approximation, A WH where W, H are n k and k p non-negative matrices, respectively. Columns of W are the underlying basis vectors. Columns of H give the weights associated with each basis vector. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 7

Non-negative Matrix Factorization Let A be a n p non-negative matrix, and k > 0 an integer. NMF consists in finding an approximation, A WH where W, H are n k and k p non-negative matrices, respectively. Columns of W are the underlying basis vectors. Columns of H give the weights associated with each basis vector. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 7

In practice, the factorization rank k is often chosen such that k << min(n, p). In general, k can be bounded as (n + p)k < np. The objective behind this choice is to summarize and split the information contained in A into k factors: the columns of W. Depending on the application field, these factors are given different names:basis images, metagenes, source signals. We ll be using the term source signals in this presentation. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 8

In practice, the factorization rank k is often chosen such that k << min(n, p). In general, k can be bounded as (n + p)k < np. The objective behind this choice is to summarize and split the information contained in A into k factors: the columns of W. Depending on the application field, these factors are given different names:basis images, metagenes, source signals. We ll be using the term source signals in this presentation. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 8

Objective Introduction We study the use of PCA, SVD, and NMF to reduce the dimensionality of count data presented in a contingency table. Our primary goal is to remove noise and uncertainty. In theory, NMF can also be used for better interpretation of factoring matrices. As the rank k increases the method uncovers substructures, whose robustness can be evaluated by a cophenetic correlation coefficient. These substructures may also give evidence of nesting subtypes. Thus, NMF can reveal hierarchical structure when it exists but does not force such structure on the data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 9

Objective Introduction We study the use of PCA, SVD, and NMF to reduce the dimensionality of count data presented in a contingency table. Our primary goal is to remove noise and uncertainty. In theory, NMF can also be used for better interpretation of factoring matrices. As the rank k increases the method uncovers substructures, whose robustness can be evaluated by a cophenetic correlation coefficient. These substructures may also give evidence of nesting subtypes. Thus, NMF can reveal hierarchical structure when it exists but does not force such structure on the data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 9

Cophenetic Correlation Coefficient The cophenetic correlation coefficient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Brunet et al. (2004) to measure the stability of the clusters obtained from NMF. It is defined as the Pearson correlation between the samples distances induced by the consensus matrix, seen as a similarity matrix and their cophenetic distances from a hierarchical clustering based on these very distances, by default an average linkage is used. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 10

Cophenetic Correlation Coefficient The cophenetic correlation coefficient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Brunet et al. (2004) to measure the stability of the clusters obtained from NMF. It is defined as the Pearson correlation between the samples distances induced by the consensus matrix, seen as a similarity matrix and their cophenetic distances from a hierarchical clustering based on these very distances, by default an average linkage is used. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 10

We may consider a data set consisting of the expression levels of M genes in N samples. For gene expression studies, M is typically in the thousands, and N is typically <100 The data can be represented by an expression matrix A of size M N. One possible goal could be to find a small number of metagenes, each defined as a positive linear combination of the M genes. We can then approximate the gene expression pattern of samples as positive linear combinations of these metagenes. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 11

We may consider a data set consisting of the expression levels of M genes in N samples. For gene expression studies, M is typically in the thousands, and N is typically <100 The data can be represented by an expression matrix A of size M N. One possible goal could be to find a small number of metagenes, each defined as a positive linear combination of the M genes. We can then approximate the gene expression pattern of samples as positive linear combinations of these metagenes. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 11

We may consider a data set consisting of the expression levels of M genes in N samples. For gene expression studies, M is typically in the thousands, and N is typically <100 The data can be represented by an expression matrix A of size M N. One possible goal could be to find a small number of metagenes, each defined as a positive linear combination of the M genes. We can then approximate the gene expression pattern of samples as positive linear combinations of these metagenes. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 11

National Institute on Aging (NIA) and NIH are conducting research by collecting data from 29 NIA funded Alzheimer s Disease Centers. There are approximately 25,000 subjects. The data set includes more than 700 variables, representing demographics, behavioral status, cognitive testing, medical history, family history, clinical impressions, and diagnoses. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 12

National Institute on Aging (NIA) and NIH are conducting research by collecting data from 29 NIA funded Alzheimer s Disease Centers. There are approximately 25,000 subjects. The data set includes more than 700 variables, representing demographics, behavioral status, cognitive testing, medical history, family history, clinical impressions, and diagnoses. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 12

National Institute on Aging (NIA) and NIH are conducting research by collecting data from 29 NIA funded Alzheimer s Disease Centers. There are approximately 25,000 subjects. The data set includes more than 700 variables, representing demographics, behavioral status, cognitive testing, medical history, family history, clinical impressions, and diagnoses. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 12

2013 Geoffrey Beene Global NeuroDiscovery Challenge The Foundation for the National Institute of Health (FNIH), in association with the Geoffrey Beene Foundation Alzheimer s initiative, is conducting a study to identify male/female differences in the early cognitive decline that precedes Alzheimer s disease. The 2013 Geoffrey Beene Global NeuroDiscovery Challenge calls for original and innovative hypotheses on gender differences in pathology and neuro degenerative decline leading to Alzheimer s disease. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 13

2013 Geoffrey Beene Global NeuroDiscovery Challenge The Foundation for the National Institute of Health (FNIH), in association with the Geoffrey Beene Foundation Alzheimer s initiative, is conducting a study to identify male/female differences in the early cognitive decline that precedes Alzheimer s disease. The 2013 Geoffrey Beene Global NeuroDiscovery Challenge calls for original and innovative hypotheses on gender differences in pathology and neuro degenerative decline leading to Alzheimer s disease. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 13

2013 Geoffrey Beene Global NeuroDiscovery Challenge The winning submission(s) will receive $100,000 in prize awards. For more information you may visit www.geoffreybeenechallenge.org. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 14

2013 Geoffrey Beene Global NeuroDiscovery Challenge The winning submission(s) will receive $100,000 in prize awards. For more information you may visit www.geoffreybeenechallenge.org. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 14

Description of the data set J. Powell and his colleagues designed various education studies. Most of these studies originated with an observation made in the early 1960 s wherein some "wrong" answers given to multiple tests appeared to reflect systematic selection by test subjects. One of these studies has used a multiple choice test (known as "the Proverbs Test") contains 40 items, each with 4 alternatives. The test was given twice in each year. We have added another alternative for missing data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 15

Description of the data set J. Powell and his colleagues designed various education studies. Most of these studies originated with an observation made in the early 1960 s wherein some "wrong" answers given to multiple tests appeared to reflect systematic selection by test subjects. One of these studies has used a multiple choice test (known as "the Proverbs Test") contains 40 items, each with 4 alternatives. The test was given twice in each year. We have added another alternative for missing data. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 15

Netherlandish Proverbs by P. Bruegel the Elder, 1559 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 16

Q32: Don t cast pearls before swine 1 Put your efforts where they are appreciated 2 Don t give pearl to fools 3 Don t be wasteful 4 Don t always put yourself before everybody SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 17

Age Group 1 2 3 4 1 39 42 39 51 2 33 37 50 54 3 56 30 43 44 4 57 24 38 56 5 50 15 56 56 6 66 10 35 62 7 74 12 40 48 8 89 4 24 56 9 99 3 27 46 10 85 9 29 51 11 89 4 28 52 12 101 2 22 50 13 121 1 23 31 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 18

All 40 questions The Windsor education data is represented by an expresssion matrix A of size 2293 160 whose rows contain 2,293 students and coulmn contains 160 variables (40 questions with four alternatives). Matrix W has size 2293 k, with each of the k columns defining a source signal. Matrix H has size k 160, with each of the 160 columns representing a source signal expression pattern of the corresponding sample. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 19

All 40 questions The Windsor education data is represented by an expresssion matrix A of size 2293 160 whose rows contain 2,293 students and coulmn contains 160 variables (40 questions with four alternatives). Matrix W has size 2293 k, with each of the k columns defining a source signal. Matrix H has size k 160, with each of the 160 columns representing a source signal expression pattern of the corresponding sample. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 19

All 40 questions The Windsor education data is represented by an expresssion matrix A of size 2293 160 whose rows contain 2,293 students and coulmn contains 160 variables (40 questions with four alternatives). Matrix W has size 2293 k, with each of the k columns defining a source signal. Matrix H has size k 160, with each of the 160 columns representing a source signal expression pattern of the corresponding sample. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 19

Estimating the Factorization Rank A critical parameter in NMF is the factorization rank k. It defines the number of source signals used to approximate the target matrix. Given a NMF method and the trgaet matrix, a common approach of deciding on k is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. We use both R and JMP to estimate the value of k. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 20

Estimating the Factorization Rank A critical parameter in NMF is the factorization rank k. It defines the number of source signals used to approximate the target matrix. Given a NMF method and the trgaet matrix, a common approach of deciding on k is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. We use both R and JMP to estimate the value of k. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 20

Estimating the Factorization Rank A critical parameter in NMF is the factorization rank k. It defines the number of source signals used to approximate the target matrix. Given a NMF method and the trgaet matrix, a common approach of deciding on k is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. We use both R and JMP to estimate the value of k. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 20

Interpreting the consensus matrix The consensus matrices are computed at k = 2 to 10 for the Windsor education data. Samples are hierarchically clustered by using distances derived from consensus clustering matrix entries, colored from 0(deep blue, samples are never in the same cluster) to 1 (dark red, samples are always in the same cluster). SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 21

Interpreting the consensus matrix The consensus matrices are computed at k = 2 to 10 for the Windsor education data. Samples are hierarchically clustered by using distances derived from consensus clustering matrix entries, colored from 0(deep blue, samples are never in the same cluster) to 1 (dark red, samples are always in the same cluster). SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 21

Consensus Map of all questions for rank 2 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 22

Consensus Map of all questions for rank 3 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 23

Consensus Map of all questions for rank 4 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 24

Consensus Map of all questions for rank 2-10 SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 25

Cophenetic Correlation Coefficeint Although a visual inspection is important, it s also important to have quantitative measure of stability for each value of k. One measure proposed by (Brunet, 2004) is cophenetic coefficient, which indicates the dispersion of the consensus matrix, defined as the average connectivity matrix over many clustering runs. We observe how cophenetic coefficient changes as k increases. We select values of k where the magnitude of cophenetic coefficient begins to fall. (Hutchins et al. 2008) suggested to choose the first value where the RSS curve presents an inflection point. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 26

Cophenetic Correlation Coefficeint Although a visual inspection is important, it s also important to have quantitative measure of stability for each value of k. One measure proposed by (Brunet, 2004) is cophenetic coefficient, which indicates the dispersion of the consensus matrix, defined as the average connectivity matrix over many clustering runs. We observe how cophenetic coefficient changes as k increases. We select values of k where the magnitude of cophenetic coefficient begins to fall. (Hutchins et al. 2008) suggested to choose the first value where the RSS curve presents an inflection point. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 26

Cophenetic Correlation Coefficeint Although a visual inspection is important, it s also important to have quantitative measure of stability for each value of k. One measure proposed by (Brunet, 2004) is cophenetic coefficient, which indicates the dispersion of the consensus matrix, defined as the average connectivity matrix over many clustering runs. We observe how cophenetic coefficient changes as k increases. We select values of k where the magnitude of cophenetic coefficient begins to fall. (Hutchins et al. 2008) suggested to choose the first value where the RSS curve presents an inflection point. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 26

Estimating k Introduction SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 27

Estimating k Introduction SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 28

NMF is looking at all answer columns without paying attention to the natural grouping of 4 answers for each question. NMF does not "know" the correct answer. NMF is "unsupervised". NMF is looking at how students of different ages answer proverb questions. It is looking at the shape of the age response curve. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 29

NMF is looking at all answer columns without paying attention to the natural grouping of 4 answers for each question. NMF does not "know" the correct answer. NMF is "unsupervised". NMF is looking at how students of different ages answer proverb questions. It is looking at the shape of the age response curve. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 29

χ 2 partitioning We group this data by the usual χ 2 partitioning approach. But we have found that the problem of dimension reduction remains. We believe this is where NMF should be useful. Ideally, the factors that distinguish the members of the ultimate groups are the same but why must that be the case? Some factors may be active in one group that are not actors in another; but what are they? We think NMF answers that question. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 30

χ 2 partitioning We group this data by the usual χ 2 partitioning approach. But we have found that the problem of dimension reduction remains. We believe this is where NMF should be useful. Ideally, the factors that distinguish the members of the ultimate groups are the same but why must that be the case? Some factors may be active in one group that are not actors in another; but what are they? We think NMF answers that question. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 30

χ 2 partitioning We group this data by the usual χ 2 partitioning approach. But we have found that the problem of dimension reduction remains. We believe this is where NMF should be useful. Ideally, the factors that distinguish the members of the ultimate groups are the same but why must that be the case? Some factors may be active in one group that are not actors in another; but what are they? We think NMF answers that question. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 30

What s after PCA/SVD/NMF? A possible way forward is to use NMF to get response variables for recursive partitioning or local control approach. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 31

What s after PCA/SVD/NMF? A possible way forward is to use NMF to get response variables for recursive partitioning or local control approach. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 31

References Introduction Brunet, J.P., Tamayo, P., Gould, T.R., and Mesirov, J.P.(2004) Metagenes and Molecular Pattern Discovery using Matrix Factorization, Proceedings of the National Academy of Sciences. 101,4164-4169. Good, I.J.(1969) Some Applications of the Singular Decomposition of a Matrix, Technometrics. 11(4),823-831. Lee, D.D., and Seung, H.S.(1999) Algorithms for Non-negative Matrix Factorization, Nature. 401,788-793. SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 32

Thank You SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 33

Thank You Questions SAMSI Transitional Workshop Contingency Table Analysis via Matrix Factorization 33