Discovering Correlation in Data Vinh Nguyen (vinh.nguyen@unimelb.edu.au) Research Fellow in Data Science Computing and Information Systems DMD 7.14
Discovering Correlation Why is correlation important? Discover relationships, causality Feature ranking: select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?
What is Correlation? Correlation is used to detect pairs of variables that might have some relationship https://www.mathsisfun.com/data/correlation.html
What is Correlation? Visually identified via inspecting scatter plots https://www.mathsisfun.com/data/correlation.html
What is Correlation? Linear relations https://www.mathsisfun.com/data/correlation.html
Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping https://www.mathsisfun.com/data/correlation.html
Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping https://www.mathsisfun.com/data/correlation.html
Example of Correlated Variables Can hint potential causal relationships Business decision: increase electricity production when temperature increases
Example of Correlated Variables Correlation does not necessarily imply causality!
Example of Correlated Variables Correlation does not necessarily imply causality!
Example: Predicting Sales https://www.kaggle.com/c/rossmann-store-sales/data
Example: Predicting Sales
Example: Predicting Sales
Example: Predicting Sales Other correlations Sales vs. holiday Sales vs. day of the week Sales vs. distance to competitors Sales vs. average income in area
Why is correlation important? Discover relationships One step towards discovering causality A causes B Examples: Smoking causes lung cancer Feature ranking: select the best features for building better machine learning models
Case study: Microarrays data DNA Microarrays (Gene Chips) Measure genes level of activity https://en.wikipedia.org/wiki/bio-mems
The Central Dogma of Molecular Biology DNA makes RNA makes proteins DNA contains multiple genes containing information to produce different types of proteins To much or too little proteins of certain type can cause diseases Gene chips can measure the amount or mrna (a buffer for protein level) activity level (expression level) http://www.atdbio.com/content/14/transcription-translation-and-replication
Microarray data Each chip contains thousands of tiny probes corresponding to the genes (20k - 30k genes in humans) Activity level Gene 1 Gene 2 Gene 20K 0.3 1.2 3.1
Microarray data from Multiple Conditions Gene 1 Gene 2 Gene 3 Gene n Condition 1 2.3 1.1 0.3 2.1 Condition 2 3.2 0.2 1.2 1.1 Condition 3 1.9 3.8 2.7 0.2 Condition m 2.8 3.1 2.5 3.4 Conditions: different time points, same person different people How correlation can help?
Correlation analysis on Microarray data Can reveal genes that exhibit similar patterns similar or related functions Discover functions of unknown genes
Build genetic networks Genes do not act in isolation: they control each other or work together Regulator gene Gene A controls the activity level of Gene B: causality relationship
Genetic network Connect genes with high correlation http://www.john.ranola.org/?page_id=116
Discover genes that are relevant to a disease http://compbio.pbworks.com/w/page/16252903/microarray
Correlation and Feature Ranking Why is correlation important? Discover relationships, causality Select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?
Measures of correlations Euclidean distance Pearson coefficient Mutual Information
Notation Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4
Euclidean distance
Drawbacks of Euclidean distance Object can be represented with different measure scale Day 1 Day 2 Day 3 Day m Temperature 20 22 16 33 #Ice-creams 6022 5523 4508 7808 #Electricity 12034 10532 8890 15408 d(temp,ice-cr)= 540324 d(temp,elect)= 12309388 Euclidean distance: does not give a clear intuition about how well variables are correlated
Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but at different scale
Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)
Pearson's correlation coefficient Sample mean Range within [-1,1]: 1 for perfect positive linear correlation -1 for perfect negative linear correlation 0 means no correlation Absolute value r indicates strength of linear correlation
Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales
Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales 0.9074
Pearson's correlation coefficient What is the Pearson coefficient in this case (without computation)? Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325
Examples https://en.wikipedia.org/wiki/pearson_product-moment_correlation_coefficient
Properties of Pearson's correlation Range within [-1,1] Scale invariant: r(x,y)= r(x, Ky), K is a real positive constant Location invariant: r(x,y)= r(x, C+y), C is a real constant Can only detect linear relationships y = a.x + b + noise Cannot detect non-linear relationship y = sin(x) + noise
Introduction to Mutual Information Can detect non-linear relationships Work with discrete variables For continuous variable: discretize into bins Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High <1: low [1,3]: mid >3: high
Variable discretization Domain knowledge: assign thresholds manually Equal-width bin Equal frequency bin
Variable discretization Domain knowledge: assign thresholds manually Speed: 0-40: slow 40-70:mid >70: high Equal-width bin: bins have same length min max Equal frequency bin: bins have the same number of points
Equal frequency bin Sort the values in increasing values Take cut points as percentile 2 bins: cut point is the median 3 bins: 33.3% and 66.7% percentile
Variable discretization Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High Original variable Value distribution Low Mid High Proportion 0.33 0.33 0.33 Discretized variable
Entropy of Random Variable A measure of information content H(X)=? Low Mid High Proportion 0.33 0.33 0.33
Entropy of Random Variable A measure of information content Low Mid High Proportion 0.33 0.33 0.33 H(X)= - 0.33 x log(0.33) x 3
Entropy of Random Variable Measure the amount of surprise when observing the outcome of a random variable High surprise Low surprise Low Mid High Proportion 0.33 0.33 0.33 Low Mid High Proportion 1 0 0
Properties of the entropy H(X) 0 Entropy - smallest number of bits needed, on the average, to represent a symbol (low, high, low, low, high, mid, mid, mid ) Maximized for uniform distribution
Joint Entropy Two variables X, Y X = (high, low, low, high, mid, ) Y = (low, mid, high, mid, low ) Contingency table/joint distribution X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08
Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0
Joint Entropy X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08
Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)=?
Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)= - 0.5 x log(0.5) x 2
Mutual Information 0 MI(X,Y) min(h(x),h(y)) Normalized Mutual Information: NMI(X,Y) = MI(X,Y) / min(h(x),h(y)) Range within [0, 1]
Properties of Mutual Information The amount of information shared between two variables X and Y MI(X,Y) large: X and Y are highly correlated (dependant)
Examples Pearson: -0.0864 NMI: 0.43 (3-bin equal frequency discretization)
Examples Pearson? NMI?
Examples Pearson: 0.08 NMI: 0.009
Other applications Identifying variables that are highly correlated with a class variable Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Example: measure correlations of genes with class labels rank genes according to correlation
Which genes correlate with cancer? Genes Cancer Non cancer
Ranking Features for Building Machine Learning Models Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Cancer = f(gene1, gene2,, gene n) Use correlation to reduce the number of variables User relevant genes only: improving accuracy & performance
Notes Correlation <> Causality http://tylervigen.com/spurious-correlations Google trend correlation