Information & Correlation

Similar documents
Entropy as a measure of surprise

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Information Theory, Statistics, and Decision Trees

1. Basics of Information

Information Theory and Statistics Lecture 2: Source coding

Lecture 16 Oct 21, 2014

3F1 Information Theory, Lecture 3

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Intro to Information Theory

A Mathematical Theory of Communication

6.02 Fall 2012 Lecture #1

3F1 Information Theory, Lecture 3

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

UNIT I INFORMATION THEORY. I k log 2

1 Introduction to information theory

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

5 Mutual Information and Channel Capacity

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lecture 3 : Algorithms for source coding. September 30, 2016

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Information and Entropy. Professor Kevin Gold

Lecture 1: Shannon s Theorem

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

DCSP-3: Minimal Length Coding. Jianfeng Feng

Information Theory. Week 4 Compressing streams. Iain Murray,

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Chapter 5: Data Compression

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

CSCI 2570 Introduction to Nanocomputing

Shannon-Fano-Elias coding

Data Compression. Limit of Information Compression. October, Examples of codes 1

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Outline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview

Lecture 2: August 31

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

CSC 411 Lecture 3: Decision Trees

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

CS4800: Algorithms & Data Jonathan Ullman

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Coding of memoryless sources 1/35

INTRODUCTION TO INFORMATION THEORY

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Lecture 11: Continuous-valued signals and differential entropy

Noisy channel communication

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Information Theory Primer:

U Logo Use Guidelines

Solutions to Set #2 Data Compression, Huffman code and AEP

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

An introduction to basic information theory. Hampus Wessman

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

ECE 587 / STA 563: Lecture 5 Lossless Compression

Lecture 1: Introduction, Entropy and ML estimation

Entropy-based data organization tricks for browsing logs and packet captures

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression

2018/5/3. YU Xiangyu

Decision Trees. Gavin Brown

Digital communication system. Shannon s separation principle

Information and Entropy

Chapter 2: Source coding

Information Theory in Intelligent Decision Making

Lecture 1 : Data Compression and Entropy

Machine Learning 3. week

Quiz 2 Date: Monday, November 21, 2016

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

(Classical) Information Theory III: Noisy channel coding

Lec 05 Arithmetic Coding

Entropies & Information Theory

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Information in Biology

Shannon's Theory of Communication

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Information Theory & Decision Trees

Welcome to Comp 411! 2) Course Objectives. 1) Course Mechanics. 3) Information. I thought this course was called Computer Organization

Introduction to Data Science Data Mining for Business Analytics

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

the Information Bottleneck

Chapter 9 Fundamental Limits in Information Theory

Transcription:

Information & Correlation Jilles Vreeken 11 June 2014 (TADA)

Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this?

Bits and Pieces What is information a bit entropy information theory compression

Information Theory Branch of science concerned with measuring information Field founded by Claude Shannon in 1948, A Mathematical Theory of Communication Information Theory is essentially about uncertainty in communication: not what you say, but what you could say

The Big Insight Communication is a series of discrete messages each message reduces the uncertainty of the recipient about a) the series and b) that message by how much? that is the amount of information

Uncertainty Shannon showed that uncertainty can be quantified, linking physical entropy to messages Shannon defined the entropy of a discrete random variable X as H(X) = P(x i )log P(x i ) i

Optimal prefix-codes Shannon showed that uncertainty can be quantified, linking physical entropy to messages A side-result of Shannon entropy is that log 2 P x i gives the length in bits of the optimal prefix code for a message x i

What is a prefix code? Prefix(-free) code: a code C where no code word c C is the prefix of another d C with c d Essentially, a prefix code defines a tree, where each code corresponds to a path from the root to a leaf in a decision tree

What s a bit? Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African talking drums

Morse code

Natural language Punishes bad redundancy: often-used words are shorter Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication mimic vocalized sounds: tonal language very reliable means of communication

Measuring bits How much information carries a given string? how many bits? Say we have a binary string of 10000 messages 1) 00010001000100010001 000100010001000100010001000100010001 2) 01110100110100100110 101011101011101100010110001011011100 3) 00011000001010100000 001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000 0000000 obviously, they are 10000 bits long. But, are they worth those 10000 bits?

So, how many bits? Depends on the encoding! What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code An encoding matching Shannon Entropy is optimal

Tell us! How many bits? Please? In our simplest example we have P(1) = 1/100000 P(0) = 99999/100000 cccc 1 = log (1/100000) = 16.61 cccc 0 = log (99999/100000) = 0.0000144 So, knowing P our string contains 1 16.61 + 99999 0.0000144 = 18.049 bits of information

Optimal. Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948, within one bit of the optimal, but not lowest expected Fano gave students an option: regular exam, or invent a better encoding David Huffman didn t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs. (arithmetic coding is overall optimal, Rissanen 1976)

Optimality To encode optimally, we need optimal probabilities What happens if we don t? Kullback-Leibler divergence, D(p q), measures bits we waste when we use p while q is the true distribution D p q = log i p i q i p(i)

Multivariate Entropy So far we ve been thinking about a single sequence of messages How does entropy work for multivariate data? Simple!

Conditional Entropy Entropy, for when we, like, know stuff H X Y = p x H(Y X = x) x X When is this useful?

Mutual Information and Correlation Mutual Information the amount of information shared between two variables X and Y I X, Y = H X H X Y = H Y H Y X = p x, y log y Y x X p x, y p x p y high correlation low independence

Information Gain (small aside) Entropy and KL are used in decision trees What is the best split in a tree? one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy How do we compare over multiple options? II T, a = H T H(T a)

Low-Entropy Sets Theory of Computation Probability Theory 1 No No 1887 Yes No 156 No Yes 143 Yes yes 219 (Heikinheimo et al. 2007)

Low-Entropy Sets Maturity Test Software Engineering Theory of Computation No No No 1570 Yes No No 79 No Yes No 99 Yes Yes No 282 No No Yes 28 Yes No Yes 164 No Yes Yes 13 Yes Yes Yes 170 (Heikinheimo et al. 2007)

Low-Entropy Trees Scientific Writing Maturity Test Software Engineering Project Theory of Computation Probability Theory 1 (Heikinheimo et al. 2007)

Entropy for Continuous-valued data So far we only considered discrete-valued data Lots of data is continuous-valued (or is it) What does this mean for entropy?

Differential Entropy h X = f x log f x dd X (Shannon, 1948)

Differential Entropy How about the entropy of Uniform(0,1/2)? 1 2 2 log 2 dd = log 2 0 Hm, negative?

Differential Entropy In discrete data step size dx is trivial. What is its effect here? h X = f x log f x dd X (Shannon, 1948)

Cumulative Distributions

Cumulative Entropy We can define entropy for cumulative distribution functions! h CC X = ddd X P X x log P X x dd As 0 P X x 1 we obtain h CC X 0 (!) (Rao et al, 2004, 2005)

Cumulative Entropy How do we compute it in practice? Easy. Let X 1 X n be i.i.d. random samples of continuous random variable X n 1 h CC X = X i+1 X i i=1 i n log i n (Rao et al, 2004, 2005)

Multivariate Cumulative Entropy? Tricky. Very tricky. Too tricky for now. (Nguyen et al, 2013, 2014)

Cumulative Mutual Information Given continuous valued data over a set of attributes X we want to identify Y X such that Y has high mutual information. Can we do this with cumulative entropy?

Identifying Interacting Subspaces

Multivariate Cumulative Entropy First things first. We need h CC X y = h CC X y p y dd which, in practice, means h CC X Y = h CC X y p(y) y Y with y just some datapoints, and p y = y n How do we choose y? such that h CC X Y is minimal

Entrez, CMI We cannot (realistically) calculate h CC X 1,, X m in one go but Mutual Information has this nice factorization property so, what we can do is h CC X i i=2 h CC (X i X 1,, X i 1 ) i=2

The CMI algorithm super simple: a priori-style

CMI in action

Conclusions Information is about uncertainty of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc Entropy for continuous data is more tricky differential entropy is a bit problematic cumulative distributions provide a way out, but are mostly unchartered territory

Thank you! Information is about uncertainty of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc Entropy for continuous data is more tricky differential entropy is a bit problematic cumulative distributions provide a way out, but are mostly unchartered territory