Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Similar documents
Business Mathematics and Statistics (MATH0203) Chapter 1: Correlation & Regression

Is the entropy a good measure of correlation?

Quantitative Biology Lecture 3

Correlation and Regression

Information in Biology

Information in Biology

Network Biology-part II

MACHINE LEARNING FOR CAUSE-EFFECT PAIRS DETECTION. Mehreen Saeed CLE Seminar 11 February, 2014.

Inferring Transcriptional Regulatory Networks from High-throughput Data

the tree till a class assignment is reached

Optimal normalization of DNA-microarray data

Simulation of Gene Regulatory Networks

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Learning Causality. Sargur N. Srihari. University at Buffalo, The State University of New York USA

Developing Algorithms for the Determination of Relative Abundances of Peptides from LC/MS Data

Statistical Concepts. Constructing a Trend Plot

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance

Descriptive Data Summarization

Translation Part 2 of Protein Synthesis

Introduction to Bioinformatics

Bivariate Relationships Between Variables

UGRC 120 Numeracy Skills

Causal Discovery by Computer

Fuzzy Clustering of Gene Expression Data

Computational Biology: Basics & Interesting Problems

BTRY 7210: Topics in Quantitative Genomics and Genetics

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

PhysicsAndMathsTutor.com. International Advanced Level Statistics S1 Advanced/Advanced Subsidiary

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Infinite Dimensional Vector Spaces. 1. Motivation: Statistical machine learning and reproducing kernel Hilbert Spaces

The University of Jordan. Accreditation & Quality Assurance Center. COURSE Syllabus

Similarity and Dissimilarity

Chapter 14. Statistical versus Deterministic Relationships. Distance versus Speed. Describing Relationships: Scatterplots and Correlation

Statistical testing. Samantha Kleinberg. October 20, 2009

Probabilistic Causal Models

CORRELATION AND REGRESSION

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

1.1. KEY CONCEPT Biologists study life in all its forms. 4 Reinforcement Unit 1 Resource Book. Biology in the 21st Century CHAPTER 1

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Part III: Unstructured Data

MATH 10 INTRODUCTORY STATISTICS

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Series 7, May 22, 2018 (EM Convergence)

Dept. of Linguistics, Indiana University Fall 2015

Review. Number of variables. Standard Scores. Anecdotal / Clinical. Bivariate relationships. Ch. 3: Correlation & Linear Regression

Entropy Rate of Stochastic Processes

Using Entropy-Related Measures in Categorical Data Visualization

Year 10 Mathematics Semester 2 Bivariate Data Chapter 13

Correlation and Regression

AP Biology Gene Regulation and Development Review

O 3 O 4 O 5. q 3. q 4. Transition

Cybergenetics: Control theory for living cells

Correlation. Patrick Breheny. November 15. Descriptive statistics Inference Summary

Lecture 11: Continuous-valued signals and differential entropy

Mathematics, Genomics, and Cancer

Chemistry Chapter 26

Linear Regression. Volker Tresp 2018

Discovering molecular pathways from protein interaction and ge

Measuring Associations : Pearson s correlation

Applied Machine Learning Annalisa Marsico

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

20 Unsupervised Learning and Principal Components Analysis (PCA)

Singular value decomposition for genome-wide expression data processing and modeling. Presented by Jing Qiu

2. What was the Avery-MacLeod-McCarty experiment and why was it significant? 3. What was the Hershey-Chase experiment and why was it significant?

Chapter 15 Active Reading Guide Regulation of Gene Expression

LOOKING FOR RELATIONSHIPS

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc.

Chapter 2 Statistics. Mean, Median, Mode, and Range Definitions

BIOLOGY 161 EXAM 1 Friday, 8 October 2004 page 1

What is Systems Biology

Probability and Statistics. Terms and concepts

Shannon s Noisy-Channel Coding Theorem

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Learning features by contrasting natural images with noise

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Clustering and Network

Machine Learning 2nd Edition

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Pink)

Large-Margin Thresholded Ensembles for Ordinal Regression

Biochip informatics-(i)

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Biology Assessment. Eligible Texas Essential Knowledge and Skills

Lesson 11. Functional Genomics I: Microarray Analysis

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

STAAR Biology Assessment

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

BME 5742 Biosystems Modeling and Control

Chetek-Weyerhaeuser High School

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Big Idea 1: Does the process of evolution drive the diversity and unit of life?

Transcription:

Discovering Correlation in Data Vinh Nguyen (vinh.nguyen@unimelb.edu.au) Research Fellow in Data Science Computing and Information Systems DMD 7.14

Discovering Correlation Why is correlation important? Discover relationships, causality Feature ranking: select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

What is Correlation? Correlation is used to detect pairs of variables that might have some relationship https://www.mathsisfun.com/data/correlation.html

What is Correlation? Visually identified via inspecting scatter plots https://www.mathsisfun.com/data/correlation.html

What is Correlation? Linear relations https://www.mathsisfun.com/data/correlation.html

Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping https://www.mathsisfun.com/data/correlation.html

Example of non-linear correlation It gets so hot that people aren't going near the shop, and sales start dropping https://www.mathsisfun.com/data/correlation.html

Example of Correlated Variables Can hint potential causal relationships Business decision: increase electricity production when temperature increases

Example of Correlated Variables Correlation does not necessarily imply causality!

Example of Correlated Variables Correlation does not necessarily imply causality!

Example: Predicting Sales https://www.kaggle.com/c/rossmann-store-sales/data

Example: Predicting Sales

Example: Predicting Sales

Example: Predicting Sales Other correlations Sales vs. holiday Sales vs. day of the week Sales vs. distance to competitors Sales vs. average income in area

Why is correlation important? Discover relationships One step towards discovering causality A causes B Examples: Smoking causes lung cancer Feature ranking: select the best features for building better machine learning models

Case study: Microarrays data DNA Microarrays (Gene Chips) Measure genes level of activity https://en.wikipedia.org/wiki/bio-mems

The Central Dogma of Molecular Biology DNA makes RNA makes proteins DNA contains multiple genes containing information to produce different types of proteins To much or too little proteins of certain type can cause diseases Gene chips can measure the amount or mrna (a buffer for protein level) activity level (expression level) http://www.atdbio.com/content/14/transcription-translation-and-replication

Microarray data Each chip contains thousands of tiny probes corresponding to the genes (20k - 30k genes in humans) Activity level Gene 1 Gene 2 Gene 20K 0.3 1.2 3.1

Microarray data from Multiple Conditions Gene 1 Gene 2 Gene 3 Gene n Condition 1 2.3 1.1 0.3 2.1 Condition 2 3.2 0.2 1.2 1.1 Condition 3 1.9 3.8 2.7 0.2 Condition m 2.8 3.1 2.5 3.4 Conditions: different time points, same person different people How correlation can help?

Correlation analysis on Microarray data Can reveal genes that exhibit similar patterns similar or related functions Discover functions of unknown genes

Build genetic networks Genes do not act in isolation: they control each other or work together Regulator gene Gene A controls the activity level of Gene B: causality relationship

Genetic network Connect genes with high correlation http://www.john.ranola.org/?page_id=116

Discover genes that are relevant to a disease http://compbio.pbworks.com/w/page/16252903/microarray

Correlation and Feature Ranking Why is correlation important? Discover relationships, causality Select the best features for building better machine learning models Measures of correlations: Euclidean distance Pearson coefficient Mutual Information Case study: Microarrays data What is it? How to collect? How to build genetic networks from correlation Which genes cause skin cancer?

Measures of correlations Euclidean distance Pearson coefficient Mutual Information

Notation Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4

Euclidean distance

Drawbacks of Euclidean distance Object can be represented with different measure scale Day 1 Day 2 Day 3 Day m Temperature 20 22 16 33 #Ice-creams 6022 5523 4508 7808 #Electricity 12034 10532 8890 15408 d(temp,ice-cr)= 540324 d(temp,elect)= 12309388 Euclidean distance: does not give a clear intuition about how well variables are correlated

Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but at different scale

Drawbacks of Euclidean distance Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)

Pearson's correlation coefficient Sample mean Range within [-1,1]: 1 for perfect positive linear correlation -1 for perfect negative linear correlation 0 means no correlation Absolute value r indicates strength of linear correlation

Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales

Pearson's correlation coefficient Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325 11 $185 Compute Pearson correlation between Temperature and Ice-cream sales 0.9074

Pearson's correlation coefficient What is the Pearson coefficient in this case (without computation)? Ice Cream Sales vs Temperature Temperature C Ice Cream Sales 14 $215 16 $325

Examples https://en.wikipedia.org/wiki/pearson_product-moment_correlation_coefficient

Properties of Pearson's correlation Range within [-1,1] Scale invariant: r(x,y)= r(x, Ky), K is a real positive constant Location invariant: r(x,y)= r(x, C+y), C is a real constant Can only detect linear relationships y = a.x + b + noise Cannot detect non-linear relationship y = sin(x) + noise

Introduction to Mutual Information Can detect non-linear relationships Work with discrete variables For continuous variable: discretize into bins Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High <1: low [1,3]: mid >3: high

Variable discretization Domain knowledge: assign thresholds manually Equal-width bin Equal frequency bin

Variable discretization Domain knowledge: assign thresholds manually Speed: 0-40: slow 40-70:mid >70: high Equal-width bin: bins have same length min max Equal frequency bin: bins have the same number of points

Equal frequency bin Sort the values in increasing values Take cut points as percentile 2 bins: cut point is the median 3 bins: 33.3% and 66.7% percentile

Variable discretization Activity level Discretized level Gene 1 Gene 2 Gene 20K 1.2 0.2 3.1 Mid Low High Original variable Value distribution Low Mid High Proportion 0.33 0.33 0.33 Discretized variable

Entropy of Random Variable A measure of information content H(X)=? Low Mid High Proportion 0.33 0.33 0.33

Entropy of Random Variable A measure of information content Low Mid High Proportion 0.33 0.33 0.33 H(X)= - 0.33 x log(0.33) x 3

Entropy of Random Variable Measure the amount of surprise when observing the outcome of a random variable High surprise Low surprise Low Mid High Proportion 0.33 0.33 0.33 Low Mid High Proportion 1 0 0

Properties of the entropy H(X) 0 Entropy - smallest number of bits needed, on the average, to represent a symbol (low, high, low, low, high, mid, mid, mid ) Maximized for uniform distribution

Joint Entropy Two variables X, Y X = (high, low, low, high, mid, ) Y = (low, mid, high, mid, low ) Contingency table/joint distribution X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0

Joint Entropy X\Y Low Mid High Low 0.1 0.2 0.2 Mid 0.2 0.1 0.05 High 0.03 0.03 0.08

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)=?

Scatter plot to Contingency Table Low High Low 0 0.5 High 0.5 0 H(X,Y)= - 0.5 x log(0.5) x 2

Mutual Information 0 MI(X,Y) min(h(x),h(y)) Normalized Mutual Information: NMI(X,Y) = MI(X,Y) / min(h(x),h(y)) Range within [0, 1]

Properties of Mutual Information The amount of information shared between two variables X and Y MI(X,Y) large: X and Y are highly correlated (dependant)

Examples Pearson: -0.0864 NMI: 0.43 (3-bin equal frequency discretization)

Examples Pearson? NMI?

Examples Pearson: 0.08 NMI: 0.009

Other applications Identifying variables that are highly correlated with a class variable Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Example: measure correlations of genes with class labels rank genes according to correlation

Which genes correlate with cancer? Genes Cancer Non cancer

Ranking Features for Building Machine Learning Models Gene 1 Gene 2 Gene 3 Gene n Person 1 2.3 1.1 0.3 2.1 Person 2 3.2 0.2 1.2 1.1 Person 3 1.9 3.8 2.7 0.2 Person m 2.8 3.1 2.5 3.4 Cancer 1 1 0 0 Cancer = f(gene1, gene2,, gene n) Use correlation to reduce the number of variables User relevant genes only: improving accuracy & performance

Notes Correlation <> Causality http://tylervigen.com/spurious-correlations Google trend correlation