Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Discovering Correlation in Data Vinh Nguyen (vinh.nguyen@unimelb.edu.au) Research Fellow in Data Science Computing and Information Systems DMD 7.14

Discovering Correlation Why is correlation important? Discover relationships, causality Feature ranking: select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

What is Correlation? Correlation is used to detect pairs of variables that might have some relationship https://www.mathsisfun.com/data/correlation.html

What is Correlation? Visually identified via inspecting scatter plots https://www.mathsisfun.com/data/correlation.html

What is Correlation? Linear relations https://www.mathsisfun.com/data/correlation.html

Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping https://www.mathsisfun.com/data/correlation.html

Example of Correlated Variables Can hint potential causal relationships Business decision: increase electricity production when temperature increases

Example of Correlated Variables Correlation does not necessarily imply causality!

Example: Predicting Sales https://www.kaggle.com/c/rossmann-store-sales/data

Example: Predicting Sales

Example: Predicting Sales Other correlations Sales vs. holiday Sales vs. day of the week Sales vs. distance to competitors Sales vs. average income in area

Why is correlation important? Discover relationships One step towards discovering causality A causes B Examples: Smoking causes lung cancer Feature ranking: select the best features for building better machine learning models

Case study: Microarrays data DNA Microarrays (Gene Chips) Measure genes level of activity https://en.wikipedia.org/wiki/bio-mems

The Central Dogma of Molecular Biology DNA makes RNA makes proteins DNA contains multiple genes containing information to produce different types of proteins To much or too little proteins of certain type can cause diseases Gene chips can measure the amount or mrna (a buffer for protein level) activity level (expression level) http://www.atdbio.com/content/14/transcription-translation-and-replication

Microarray data Each chip contains thousands of tiny probes corresponding to the genes (20k - 30k genes in humans) Activity level Gene 1 Gene 2 Gene 20K 0.3 1.2 3.1

Microarray data from Multiple Conditions Gene 1 Gene 2 Gene 3 Gene n Condition 1 2.3 1.1 0.3 2.1 Condition 2 3.2 0.2 1.2 1.1 Condition 3 1.9 3.8 2.7 0.2 Condition m 2.8 3.1 2.5 3.4 Conditions: different time points, same person different people How correlation can help?

Correlation analysis on Microarray data Can reveal genes that exhibit similar patterns similar or related functions Discover functions of unknown genes

Build genetic networks Genes do not act in isolation: they control each other or work together Regulator gene Gene A controls the activity level of Gene B: causality relationship

Genetic network Connect genes with high correlation http://www.john.ranola.org/?page_id=116

Discover genes that are relevant to a disease http://compbio.pbworks.com/w/page/16252903/microarray

Correlation and Feature Ranking Why is correlation important? Discover relationships, causality Select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

Measures of correlations Euclidean distance Pearson coefficient Mutual Information

Notation Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4

Euclidean distance

Drawbacks of Euclidean distance Object can be represented with different measure scale Day 1 Day 2 Day 3 Day m Temperature 20 22 16 33 #Ice-creams 6022 5523 4508 7808 #Electricity 12034 10532 8890 15408 d(temp,ice-cr)= 540324 d(temp,elect)= 12309388 Euclidean distance: does not give a clear intuition about how well variables are correlated

Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but at different scale

Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)

Pearson's correlation coefficient Sample mean Range within [-1,1]: 1 for perfect positive linear correlation -1 for perfect negative linear correlation 0 means no correlation Absolute value r indicates strength of linear correlation

Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales

Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales 0.9074

Pearson's correlation coefficient What is the Pearson coefficient in this case (without computation)? Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325

Examples https://en.wikipedia.org/wiki/pearson_product-moment_correlation_coefficient

Properties of Pearson's correlation Range within [-1,1] Scale invariant: r(x,y)= r(x, Ky), K is a real positive constant Location invariant: r(x,y)= r(x, C+y), C is a real constant Can only detect linear relationships y = a.x + b + noise Cannot detect non-linear relationship y = sin(x) + noise

Introduction to Mutual Information Can detect non-linear relationships Work with discrete variables For continuous variable: discretize into bins Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High <1: low [1,3]: mid >3: high

Variable discretization Domain knowledge: assign thresholds manually Equal-width bin Equal frequency bin

Variable discretization Domain knowledge: assign thresholds manually Speed: 0-40: slow 40-70:mid >70: high Equal-width bin: bins have same length min max Equal frequency bin: bins have the same number of points

Equal frequency bin Sort the values in increasing values Take cut points as percentile 2 bins: cut point is the median 3 bins: 33.3% and 66.7% percentile

Variable discretization Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High Original variable Value distribution Low Mid High Proportion 0.33 0.33 0.33 Discretized variable

Entropy of Random Variable A measure of information content H(X)=? Low Mid High Proportion 0.33 0.33 0.33

Entropy of Random Variable A measure of information content Low Mid High Proportion 0.33 0.33 0.33 H(X)= - 0.33 x log(0.33) x 3

Entropy of Random Variable Measure the amount of surprise when observing the outcome of a random variable High surprise Low surprise Low Mid High Proportion 0.33 0.33 0.33 Low Mid High Proportion 1 0 0

Properties of the entropy H(X) 0 Entropy - smallest number of bits needed, on the average, to represent a symbol (low, high, low, low, high, mid, mid, mid ) Maximized for uniform distribution

Joint Entropy Two variables X, Y X = (high, low, low, high, mid, ) Y = (low, mid, high, mid, low ) Contingency table/joint distribution X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0

Joint Entropy X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)=?

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)= - 0.5 x log(0.5) x 2

Mutual Information 0 MI(X,Y) min(h(x),h(y)) Normalized Mutual Information: NMI(X,Y) = MI(X,Y) / min(h(x),h(y)) Range within [0, 1]

Properties of Mutual Information The amount of information shared between two variables X and Y MI(X,Y) large: X and Y are highly correlated (dependant)

Examples Pearson: -0.0864 NMI: 0.43 (3-bin equal frequency discretization)

Examples Pearson? NMI?

Examples Pearson: 0.08 NMI: 0.009

Other applications Identifying variables that are highly correlated with a class variable Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Example: measure correlations of genes with class labels rank genes according to correlation

Which genes correlate with cancer? Genes Cancer Non cancer

Ranking Features for Building Machine Learning Models Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Cancer = f(gene1, gene2,, gene n) Use correlation to reduce the number of variables User relevant genes only: improving accuracy & performance

Notes Correlation <> Causality http://tylervigen.com/spurious-correlations Google trend correlation