Information Theory, Statistics, and Decision Trees

Similar documents
Expectation-Maximization

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Information & Correlation

Lecture 1: Introduction, Entropy and ML estimation

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Information Theory and Statistics Lecture 2: Source coding

Lecture 2: August 31

Entropy as a measure of surprise

1 Introduction to information theory

Coding of memoryless sources 1/35

3F1 Information Theory, Lecture 3

Lecture 7: DecisionTrees

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

UNIT I INFORMATION THEORY. I k log 2

Machine Learning 3. week

Quiz 2 Date: Monday, November 21, 2016

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Chapter 9 Fundamental Limits in Information Theory

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

ELEMENT OF INFORMATION THEORY

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Chapter 2: Source coding

Lecture 1 : Data Compression and Entropy

the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

3F1 Information Theory, Lecture 1

Information theory and decision tree

Information and Entropy

CSCI 2570 Introduction to Nanocomputing

Introduction to Machine Learning CMU-10701

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

CSC 411 Lecture 3: Decision Trees

Bioinformatics: Biology X

3F1 Information Theory, Lecture 3

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Intro to Information Theory

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

A Mathematical Theory of Communication

Information Theory in Intelligent Decision Making

2018/5/3. YU Xiangyu

Information in Biology

Decision Trees Lecture 12


Outline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview

Digital communication system. Shannon s separation principle

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory Primer:

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Entropies & Information Theory

Lecture 8: Shannon s Noise Models

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Notes on Machine Learning for and

CS 630 Basic Probability and Information Theory. Tim Campbell

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Information Theory. Week 4 Compressing streams. Iain Murray,

Computational Systems Biology: Biology X

Information Theory & Decision Trees

Classification & Information Theory Lecture #8

Homework Set #2 Data Compression, Huffman code and AEP

(Classical) Information Theory III: Noisy channel coding

Classification and Regression Trees

Classification: Decision Trees

Information in Biology

5 Mutual Information and Channel Capacity

Lecture 1: Shannon s Theorem

Information Theory: Entropy, Markov Chains, and Huffman Coding

Generative v. Discriminative classifiers Intuition

Asymptotic redundancy and prolixity

Learning Decision Trees

Learning Decision Trees

15-381: Artificial Intelligence. Decision trees

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

6.02 Fall 2012 Lecture #1

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Coding for Discrete Source

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Learning Decision Trees

Decision Tree Learning Lecture 2

Transcription:

Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010

Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

I. Basic Information theory Léon Bottou 3/31 COS 424 4/6/2010

Why do we care? Information theory Invented by Claude Shannon in 1948 A Mathematical Theory of Communication. Bell System Technical Journal, October 1948. The quantity of information measured in bits. The capacity of a transmission channel. Data coding and data compression. Information gain A derived concept. Quantify how much information we acquire about a phenomenon. A justification for the Kullback-Leibler divergence. Léon Bottou 4/31 COS 424 4/6/2010

The coding paradigm Intuition The quantity of information of a message is the length of the smallest code that can represent the message. Paradigm Assume there are n possible messages i = 1... n. We want a signal that indicates the occurrence of one of them. We can transmit an alphabet of r symbols. For instance a wire could carry r = 2 electrical levels. The code for message i is a sequence of l i symbols. Properties Codes should be uniquely decodable. Average code length for a message: n x=1 p i l i. Léon Bottou 5/31 COS 424 4/6/2010

Prefix codes Messages 1 and 2 have codes one symbol long (l i = 1). Messages 3 and 4 have codes two symbols long (l i = 2). Messages 5 and 6,have codes three symbols long (l i = 2). There is an unused three symbol code. That s inefficient. Properties Prefix codes are uniquely decodable. There are trickier kinds of uniquely decodable codes, e.g. a 0, b 01, c 011 versus a 0, b 10, c 110. Léon Bottou 6/31 COS 424 4/6/2010

Kraft inequality Uniquely decodable codes satisfy n ( ) 1 li 1 r x=1 All uniquely decodable codes satisfy this inequality. If integer code lengths l i satisfy this inequality, there exists a prefix code with such code lengths. Consequences If some messages have short codes, others must have long codes. To minimize the average code length: - give short codes to high probability messages. - give long codes to low probability messages. Equiprobable messages should have similar code lengths. Léon Bottou 7/31 COS 424 4/6/2010

Kraft inequality for prefix codes Prefix codes satisfy Kraft inequality r l l i r l i i ( ) 1 li 1 r All uniquely decodable codes satisfy Kraft inequality Proof must deal with infinite sequences of messages. Given integer code lengths l i : Build a balanced r-ary tree of depth l = max i l i. For each message, prune one subtree at depth l i. Kraft inequality ensures that there will be enough branches left to define a code for each message. Léon Bottou 8/31 COS 424 4/6/2010

Redundant codes Assume i r l i < 1 There are leftover branches in the tree. There are codes that are not used, or There are multiple codes for each message. For best compression, i r l i = 1 This is not always possible with integer code lengths l i. But we can use this to compute a lower bound. Léon Bottou 9/31 COS 424 4/6/2010

Lower bound for the average code length Choose code lengths l i such that min p i l i subject to l 1...l n i i r l i = 1, l i > 0 Define s i = r l i, that is, l i = log r (s i ). Maximize C = p i log r (s i ) subject to i s i = 1 We get C s = p i i s i log(r) = Constant, that is s i p i. Replacing in the constraint gives s i = p i. Therefore l i = log r (p i ) and p i l i = i i p i log r (p i ) Fractional code lengths What does it mean to code a message on 0.5 symbols? Léon Bottou 10/31 COS 424 4/6/2010

Arithmetic coding An infinite sequence of messages i 1, i 2,... can be viewed as a number x = 0.i 1 i 2 i 3... in base n. An infinite sequence of symbols c 1, c 2,... can be viewed as a number y = 0.c 1 c 2 c 3... in base r. Léon Bottou 11/31 COS 424 4/6/2010

Arithmetic coding To encode a sequence of L messages i 1,..., i L. L The code y must belong to an interval of size p ik. It is sufficient to specify l(i 1 i 2... i L ) = L log r (p ik ) digits of y. Léon Bottou 12/31 COS 424 4/6/2010 k=1 k=1

Arithmetic coding To encode a sequence of L messages i 1,..., i L. It is sufficient to specify l(i 1 i 2... i L ) = The average code length per message is 1 L p L i1... p il log r (p ik ) i 1 i 2...i L k=1 L L log r (p ik ) p i1... p il L i 1 i 2...i L = 1 L L k=1 i 1...i L \i k k=1 ( h k p ih ) r L log r (p ik ) digits of y. k=1 i k =1 p ik log p ik = i p i log p i Arithmetic coding reaches the lower bound when L. Léon Bottou 13/31 COS 424 4/6/2010

Quantity of information Optimal code length: l i = log r (p i ). Optimal expected code length: p i l i = p i log r (p i ). Receiving a message x with probability p x : The acquired information is h(x) = log 2 (p x ) bits. An informative message is a surprising message! Expecting a message X with distribution p 1... p n : The expected information is H(X) = x X p x log 2 (p x ) bits. This is also called entropy. These are two distinct definitions! Note how we switched to logarithms in base two. This is a multiplicative factor: log 2 (p) = log r (p) log 2 (r). Choosing base 2 defines a unit of information: the bit. Léon Bottou 14/31 COS 424 4/6/2010

Mutual information Expected information: H(X) = i P (X = i) log P (X = i) Joint information: H(X, Y ) = i,j P(X = i, Y = j) log P (X = i, Y = j) Mutual information: I(X, Y ) = H(X) + H(Y ) H(X, Y ) Léon Bottou 15/31 COS 424 4/6/2010

II. Decision trees Léon Bottou 16/31 COS 424 4/6/2010

Car mileage Predict which cars have better mileage than 19mpg. mpg cyl disp hp weight accel year name 15.0 8 350.0 165.0 3693 11.5 70 buick skylark 320 18.0 8 318.0 150.0 3436 11.0 70 plymouth satellite 15.0 8 429.0 198.0 4341 10.0 70 ford galaxie 500 14.0 8 454.0 220.0 4354 9.0 70 chevrolet impala 15.0 8 390.0 190.0 3850 8.5 70 amc ambassador dpl 14.0 8 340.0 160.0 3609 8.0 70 plymouth cuda 340 18.0 4 121.0 112.0 2933 14.5 72 volvo 145e 22.0 4 121.0 76.00 2511 18.0 72 volkswagen 411 21.0 4 120.0 87.00 2979 19.5 72 peugeot 504 26.0 4 96.0 69.00 2189 18.0 72 renault 12 22.0 4 122.0 86.00 2310 16.0 72 ford pinto 28.0 4 97.0 92.00 2288 17.0 72 datsun 510 13.0 8 440.0 215.0 4735 11.0 73 chrysler new yorker... Léon Bottou 17/31 COS 424 4/6/2010

Questions Many questions can distinguish cars How many cylinders? (3,4,5,8) Displacement greater than 200 cu in? (yes, no) Displacement greater than x cu in? (yes, no) Weight greater than x lbs? (yes, no) Model name longer than x characters (yes, no) etc... Which question brings the most information about the task? Build contingency table. Compare mutual informations I(Question, M pg > 19). Possible answers ansa ansb ansc ansd mpg>19 12 23 65 5 mpg 19 18 12 4 4 Léon Bottou 18/31 COS 424 4/6/2010

Mutual information Consider a contingency table, x ij. 1 j p refers to the question answers X. 1 i n refers to the target values Y. ansa ansb ansc ansd mpg>19 12 23 65 5 mpg 19 18 12 4 4 Let x i = p j=1 x ij, x j = n i=1 x ij, and x = n i=1 p j=1 x ij. Mutual information: I(X, Y ) = H(X, Y ) + H(X) + H(Y ) = ij x ij x log x ij x j x j x log x j x i x i x log x i x Léon Bottou 19/31 COS 424 4/6/2010

Decision stump The question generates a partition of the examples. Now we can repeat the process for each node: build the contingency tables. pick the most informative question. Léon Bottou 20/31 COS 424 4/6/2010

Decision trees Until all leafs contain a single car. Léon Bottou 21/31 COS 424 4/6/2010

Decision trees Then label each leaf with class MP G > 19 or MP G 19. We can now say if a car does more than 19mpg by asking a few questions. But that is learning by heart! Léon Bottou 22/31 COS 424 4/6/2010

Pruning the decision tree We can label each node with its dominant class MP G > 19 or MP G 19. The usual picture. Should we use a validation set? Which stopping criterion? the node depth? the node population? Léon Bottou 23/31 COS 424 4/6/2010

The χ 2 independence test We met this test when studying correspondence analysis (lecture 10). x i = p j=1 x ij x j = n i=1 x ij x = n i=1 p j=1 x ij E ij = x i x j x If the rows and columns variables were independent X 2 = (x ij E ij ) 2 would asymptotically follow a χ 2 distribution E ij ij with (n 1)(p 1) degrees of freedom. Léon Bottou 24/31 COS 424 4/6/2010

Pruning a decision tree with the χ 2 test We want to prune nodes when the contingency table suggests that there is no dependence between the question and the target class. Compute X 2 = ij (x ij E ij ) 2 E ij for each node. Prune if 1 F χ 2(X) > p. Parameter p could be picked by cross-validation. But choosing p = 0.05 often works well enough. Léon Bottou 25/31 COS 424 4/6/2010

Conclusion Good points Decision trees run quickly. Decision trees can handle all kinds of input variables. Decision trees can be interpreted relatively easily. Decision trees can handle lots of irrelevant features. Bad points Decision trees are moderately accurate. Small changes in the training set can lead to very different trees. (were we speaking about interpretability... ) Notes Other names for decision trees: ID3, C4.5, CART. Regression tree when the target is continuous. Léon Bottou 26/31 COS 424 4/6/2010

III. Information theory and statistics Léon Bottou 27/31 COS 424 4/6/2010

Revisiting decision trees : likelihoods The tree as a model of P (Y X) Estimate P (Y X) by the target frequencies in the leaf for X. We can compute the likelihood of the data in this model. Likelihood gain when splitting a node Let x ij be the contingency table for a node and a question. Splitting the node with a question increases the likelihood: log L after log L before = ij x ij log x ij x j i x i log x i x = ij x ij log x ij x x x j i x i log x i x = ij x ij log x ij x j x j log x j x i x i log x i x Compare with slide 19. Léon Bottou 28/31 COS 424 4/6/2010

Revisiting decision trees : log loss The tree as a discriminant function p Define f(x) = log X where p 1 p X is the frequency of positive examples X in the leaf corresponding to X. ( ( log 1 + e yf(x)) log 1 1 p ) X p = log(p = X X ) if y = 1 ( log 1 p ) X 1 p = log(1 p X X ) if y = 1 Log loss reduction when splitting a node Let x ij be the contingency table for a node and a question. R before R after = x i log x i + x x ij log x ij x i j i j = x ij log x ij x x j log x j x x i log x i x ij j i Compare with slides 19 and 28. Note: regression trees use the mean squared loss. Léon Bottou 29/31 COS 424 4/6/2010

Kullback Leibler divergence Definition KL divergence between a true distribution P (X) and an estimated distribution P θ (X). D(P P θ ) = log P (x) P θ (x) dp (x) = = P (x) log P θ (x) } x {{} H approx H opt : Optimal coding length for X. x P (x) log P (x) P θ (x) P (x) log P (x) } x {{} H opt H approx : Expected code length for X when the code is designed for distribution P θ instead of the true distribution P. The KL divergence measures the excess coding bits when the code is optimized for the estimated distribution instead of the true distribution. Léon Bottou 30/31 COS 424 4/6/2010

Maximum Likelihood Minimize KL divergence min D(P P θ ) = log P (x) θ P θ (x) dp (x) max θ log P θ (x) dp (x) Maximize Log Likelihood max θ 1 n n log P θ (x i ) i=1 The log likelihood estimates Constant D(P P θ ) using the training set. Maximizing the likelihood minimizes an estimate of the excess coding bits obtained by coding the training set. One hopes to achieve a good coding performance on future data. The Vapnik-Chervonenkis theory gives confidence intervals for the deviation ( ) ( ) 1 n log P θ (x) dp (x) log P θ (x i ) n i=1 Léon Bottou 31/31 COS 424 4/6/2010