Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Similar documents
Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Linear Dimensionality Reduction

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

CSC 411: Lecture 09: Naive Bayes

Machine Learning (Spring 2012) Principal Component Analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Discriminant analysis and supervised classification

Methodological Concepts for Source Apportionment

5. Discriminant analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Data Mining 4. Cluster Analysis

Statistical Pattern Recognition

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Bayesian Decision Theory

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Classification 2: Linear discriminant analysis (continued); logistic regression

Bayes Decision Theory

Principal Components Analysis. Sargur Srihari University at Buffalo

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Table of Contents. Multivariate methods. Introduction II. Introduction I

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Lecture 3: Pattern Classification

Evaluation requires to define performance measures to be optimized

Principal component analysis

Similarity and Dissimilarity

STAT 730 Chapter 14: Multidimensional scaling

Unsupervised machine learning

Regularized Discriminant Analysis and Reduced-Rank LDA

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Unconstrained Ordination

Multivariate Analysis of Variance

Experimental Design and Data Analysis for Biologists

Bayes Decision Theory - I

Machine Learning 2nd Edition

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

L11: Pattern recognition principles

The generative approach to classification. A classification problem. Generative models CSE 250B

Introduction to multivariate analysis Outline

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Linear Models for Classification

CMSC858P Supervised Learning Methods

1. Introduction to Multivariate Analysis

Basics of Multivariate Modelling and Data Analysis

Evaluation. Andrea Passerini Machine Learning. Evaluation

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Multivariate Statistical Analysis

Machine learning - HT Maximum Likelihood

Computer Vision Group Prof. Daniel Cremers. 3. Regression

STT 843 Key to Homework 1 Spring 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classification: Linear Discriminant Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

An introduction to multivariate data

Metric-based classifiers. Nuno Vasconcelos UCSD

PCA, Kernel PCA, ICA

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Classification 1: Linear regression of indicators, linear discriminant analysis

Introduction to Machine Learning

Gaussian random variables inr n

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Machine Learning Lecture 7

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Dimensionality Reduction

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning

ECE521 week 3: 23/26 January 2017

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COM336: Neural Computing

Kernel methods, kernel SVM and ridge regression

Basics of Multivariate Modelling and Data Analysis

Statistics Toolbox 6. Apply statistical algorithms and probability models

Visualizing Tests for Equality of Covariance Matrices Supplemental Appendix

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Minimum Error Rate Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Data Preprocessing Tasks

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Lecture 5: Clustering, Linear Regression

Motivating the Covariance Matrix

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

PRINCIPAL COMPONENTS ANALYSIS

Bayesian Decision and Bayesian Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Machine Learning Lecture 5

Face detection and recognition. Detection Recognition Sally

Transcription:

Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012

Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing values: md.pattern, mice MDS: Metric / non-metric Dissimilarities: daisy PCA LDA

Two variables: Covariance and Correlation Covariance: Cov(X; Y ) = E[(X E[X])(Y E[Y ])] 2 [ 1; 1] Correlation: Corr(X; Y ) = Cov(X;Y ) ¾ X ¾ Y 2 [ 1; 1] Sample covariance: Sample correlation: dcov(x; y) = 1 n 1 P n i=1 (x i x)(y i y) r xy = d Cor(x; y) = ccov(x;y) ^¾ x^¾ y Correlation is invariant to changes in units, covariance is not (e.g. kilo/gram, meter/kilometer, etc.) 2

Scatterplot: Correlation is scale invariant 3

Intuition and pitfalls for correlation Correlation = LINEAR relation 4

Covariance matrix / correlation matrix: Table of pairwise values True covariance matrix: True correlation matrix: ij = Cov(X i ; X j ) C ij = Cor(X i ; X j ) Sample covariance matrix: Diagonal: Variances Sample correlation matrix: Diagonal: 1 S ij = d Cov(x i ; x j ) R ij = d Cor(x i ; x j ) R: Functions cov, cor in package stats 5

Multivariate Normal Distribution: Most common model choice Sq. Mahalanobis Distance MD 2 (x) f(x; ¹; ) = p 1 exp 1 2¼j j 2 (x ¹)T 1 (x ¹) = Sq. distance from mean in standard deviations IN DIRECTION OF X 6

Mahalanobis distance: Example ¹ = µ 0 0 ; = µ 25 0 0 1 (0,10) MD = 10 7

Mahalanobis distance: Example ¹ = µ 0 0 ; = µ 25 0 0 1 (10, 7) MD = 7.3 8

Glyphplots: Stars Which cities are special? Which cities are like New Orleans? Seattle and Miami are quite far apart; how do they compare? R: Function stars in package stats 9

Mosaic plot with shading R: Function mosaic in package vcd Suprisingly small observed cell count Suprisingly large observed cell count p-value of independence test: Highly significant 10

Outliers: Theory of Mahalanobis Distance Assume data is multivariate normally distributed (d dimensions) Squared Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom Expected value: d ( By definition : Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.) 11

Outliers: Check for multivariate outlier Are there samples with estimated Mahalanobis distance that don t fit at all to a Chi-Square distribution? Check with a QQ-Plot Technical details: - Chi-Square distribution is still reasonably good for estimated Mahalanobis distance - use robust estimates for ¹; R: Function «chisq.plot» in package «mvoutlier» 12

Outliers: chisq.plot Outlier easily detected! 13

Missing values: Problem of Single Imputation Too optimistic: Imputation model (e.g. in Y = a + bx) is just estimated, but not the true model Thus, imputed values have some uncertainty Single Imputation ignores this uncertainty Coverage probability of confidence intervals is wrong Solution: Multiple Imputation Incorporates both - residual error - model uncertainty (excluding model mis-specification) R: Package «mice» for Multiple Imputation using chained equations 14

Multiple Imputation: MICE?? Aggregate results Impute several times Do standard analysis for each imputed data set; get estimate and std.error 15

Idea of MDS Represent high-dimensional point cloud in few (usually 2) dimensions keeping distances between points similar Classical/Metric MDS: Use a clever projection - guaranteed to find optimal solution only for euclidean distance - fast R: Function cmdscale in base distribution Non-metric MDS: - Squeeze data on table = minimize STRESS - only conserve ranks = allow monotonic transformations before reducing dimensions - slow(er) R: Function isomds in package MASS 16

Distance: To scale or not to scale If variables are not scaled - variable with largest range has most weight - distance depends on scale Scaling gives every variable equal weight Similar alternative is re-weighing: d(i; j) = p w 1 (x i1 x j1 ) 2 + w 2 (x i2 x j2 ) 2 + ::: + w p (x ip x jp ) 2 Scale if, - variables measure different units (kg, meter, sec, ) - you explicitly want to have equal weight for each variable Don t scale if units are the same for all variables Most often: Better to scale. 17

Dissimilarity for mixed data: Gower s Dissim. Idea: Use distance measure between 0 and 1 for each variable: Aggregate: d (f) ij d(i; j) = 1 p Binary (a/s), nominal: Use methods discussed before - asymmetric: one group is much larger than the other Interval-scaled: x if : Value for object i in variable f R f : Range of variable f for all objects d (f) ij P p i=1 d(f) ij = jx if x jf j R f Ordinal: Use normalized ranks; then like interval-scaled based on range R: Function daisy in package cluster 18

PCA: Goals Goal 1: Dimension reduction to a few dimensions while explaining most of the variance (use first few PC s) Goal 2: Find one-dimensional index that separates objects best (use first PC) 19

PCA (Version 1): Orthogonal directions PC 1 is direction of largest variance PC 2 is - perpendicular to PC 1 - again largest variance PC 3 is - perpendicular to PC 1, PC 2 - again largest variance etc. PC 3 PC 1 PC 2 20

How many PC s: Blood Example Rule 1: 5 PC s Rule 3: Ellbow after PC 1 (?) Rule 2: 3 PC s 21

Biplot: Show info on samples AND variables Approximately true: Data points: Projection on first two PCs Distance in Biplot ~ True Distance Projection of sample onto arrow gives original (scaled) value of that variable Arrowlength: Variance of variable Angle between Arrows: Correlation Approximation is often crude; good for quick overview 22

Supervised Learning: LDA P(CjX) = P (C)P (XjC) P (X)» P(C)P(XjC) Find some estimate Prior / prevalence: Fraction of samples in that class Assume: XjC» N(¹ c ; ) Bayes rule: Choose class where P(C X) is maximal (rule is optimal if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1 X) > 0.5 or - choose c=1 if posterior odds P(C=1 X)/P(C=0 X) > 1 In Practice: Estimate P C, μ C, Σ 23

LDA Linear decision boundary Orthogonal directions of best separation 1. Principal Component 1. Linear Discriminant = 1. Canonical Variable Balance prior and mahalanobis distance 1 Classify to which class? Consider: 0 Prior Mahalanobis distance to class center 24

LDA: Quality of classification Use training data also as test data: Overfitting Too optimistic for error on new data Separate test data Training Test Cross validation (CV; e.g. leave-one-out cross validation): Every row is the test case once, the rest in the training data 25