Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Size: px

Start display at page:

Download "Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7."

Bartholomew Conley
6 years ago
Views:

1 Discovering Correlation in Data Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.14

2 Discovering Correlation Why is correlation important? Discover relationships, causality Feature ranking: select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

3 What is Correlation? Correlation is used to detect pairs of variables that might have some relationship

4 What is Correlation? Visually identified via inspecting scatter plots

5 What is Correlation? Linear relations

6 Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping

7 Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping

8 Example of Correlated Variables Can hint potential causal relationships Business decision: increase electricity production when temperature increases

9 Example of Correlated Variables Correlation does not necessarily imply causality!

10 Example of Correlated Variables Correlation does not necessarily imply causality!

11 Example: Predicting Sales

12 Example: Predicting Sales

13 Example: Predicting Sales

14 Example: Predicting Sales Other correlations Sales vs. holiday Sales vs. day of the week Sales vs. distance to competitors Sales vs. average income in area

15 Why is correlation important? Discover relationships One step towards discovering causality A causes B Examples: Smoking causes lung cancer Feature ranking: select the best features for building better machine learning models

16 Case study: Microarrays data DNA Microarrays (Gene Chips) Measure genes level of activity

17 The Central Dogma of Molecular Biology DNA makes RNA makes proteins DNA contains multiple genes containing information to produce different types of proteins To much or too little proteins of certain type can cause diseases Gene chips can measure the amount or mrna (a buffer for protein level) activity level (expression level)

18 Microarray data Each chip contains thousands of tiny probes corresponding to the genes (20k - 30k genes in humans) Activity level Gene 1 Gene 2 Gene 20K

19 Microarray data from Multiple Conditions Gene 1 Gene 2 Gene 3 Gene n Condition Condition Condition Condition m Conditions: different time points, same person different people How correlation can help?

20 Correlation analysis on Microarray data Can reveal genes that exhibit similar patterns similar or related functions Discover functions of unknown genes

21 Build genetic networks Genes do not act in isolation: they control each other or work together Regulator gene Gene A controls the activity level of Gene B: causality relationship

22 Genetic network Connect genes with high correlation

23 Discover genes that are relevant to a disease

24 Correlation and Feature Ranking Why is correlation important? Discover relationships, causality Select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

25 Measures of correlations Euclidean distance Pearson coefficient Mutual Information

Notation Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.

26 Notation Gene 1 Gene 2 Gene 3 Gene n Person Person Person Person m

27 Euclidean distance

28 Drawbacks of Euclidean distance Object can be represented with different measure scale Day 1 Day 2 Day 3 Day m Temperature #Ice-creams #Electricity d(temp,ice-cr)= d(temp,elect)= Euclidean distance: does not give a clear intuition about how well variables are correlated

29 Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but at different scale

30 Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)

31 Pearson's correlation coefficient Sample mean Range within [-1,1]: 1 for perfect positive linear correlation -1 for perfect negative linear correlation 0 means no correlation Absolute value r indicates strength of linear correlation

32 Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $ $ $185 Compute Pearson correlation between Temperature and Ice-cream sales

33 Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $ $ $185 Compute Pearson correlation between Temperature and Ice-cream sales

34 Pearson's correlation coefficient What is the Pearson coefficient in this case (without computation)? Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $ $325

35 Examples

36 Properties of Pearson's correlation Range within [-1,1] Scale invariant: r(x,y)= r(x, Ky), K is a real positive constant Location invariant: r(x,y)= r(x, C+y), C is a real constant Can only detect linear relationships y = a.x + b + noise Cannot detect non-linear relationship y = sin(x) + noise

37 Introduction to Mutual Information Can detect non-linear relationships Work with discrete variables For continuous variable: discretize into bins Activity level Discretized level Gene 1 Gene 2 Gene 20K Mid Low High <1: low [1,3]: mid >3: high

38 Variable discretization Domain knowledge: assign thresholds manually Equal-width bin Equal frequency bin

39 Variable discretization Domain knowledge: assign thresholds manually Speed: 0-40: slow 40-70:mid >70: high Equal-width bin: bins have same length min max Equal frequency bin: bins have the same number of points

40 Equal frequency bin Sort the values in increasing values Take cut points as percentile 2 bins: cut point is the median 3 bins: 33.3% and 66.7% percentile

Variable discretization Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.

41 Variable discretization Activity level Discretized level Gene 1 Gene 2 Gene 20K Mid Low High Original variable Value distribution Low Mid High Proportion Discretized variable

42 Entropy of Random Variable A measure of information content H(X)=? Low Mid High Proportion

43 Entropy of Random Variable A measure of information content Low Mid High Proportion H(X)= x log(0.33) x 3

44 Entropy of Random Variable Measure the amount of surprise when observing the outcome of a random variable High surprise Low surprise Low Mid High Proportion Low Mid High Proportion 1 0 0

45 Properties of the entropy H(X) 0 Entropy - smallest number of bits needed, on the average, to represent a symbol (low, high, low, low, high, mid, mid, mid ) Maximized for uniform distribution

46 Joint Entropy Two variables X, Y X = (high, low, low, high, mid, ) Y = (low, mid, high, mid, low ) Contingency table/joint distribution X\Y Low Mid High Low Mid High

47 Scatter plot to Contingency Table Low High Low High 0.5 0

48 Joint Entropy X\Y Low Mid High Low Mid High

49 Scatter plot to Contingency Table Low High Low High H(X,Y)=?

50 Scatter plot to Contingency Table Low High Low High H(X,Y)= x log(0.5) x 2

51 Mutual Information 0 MI(X,Y) min(h(x),h(y)) Normalized Mutual Information: NMI(X,Y) = MI(X,Y) / min(h(x),h(y)) Range within [0, 1]

52 Properties of Mutual Information The amount of information shared between two variables X and Y MI(X,Y) large: X and Y are highly correlated (dependant)

53 Examples Pearson: NMI: 0.43 (3-bin equal frequency discretization)

54 Examples Pearson? NMI?

55 Examples Pearson: 0.08 NMI: 0.009

56 Other applications Identifying variables that are highly correlated with a class variable Gene 1 Gene 2 Gene 3 Gene n Person Person Person Person m Cancer Example: measure correlations of genes with class labels rank genes according to correlation

57 Which genes correlate with cancer? Genes Cancer Non cancer

58 Ranking Features for Building Machine Learning Models Gene 1 Gene 2 Gene 3 Gene n Person Person Person Person m Cancer Cancer = f(gene1, gene2,, gene n) Use correlation to reduce the number of variables User relevant genes only: improving accuracy & performance

59 Notes Correlation <> Causality Google trend correlation

Business Mathematics and Statistics (MATH0203) Chapter 1: Correlation & Regression

Business Mathematics and Statistics (MATH0203) Chapter 1: Correlation & Regression Dependent and independent variables The independent variable (x) is the one that is chosen freely or occur naturally.