Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Similar documents
Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Probabilistic and Bayesian Analytics

VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Generative v. Discriminative classifiers Intuition

Cross-validation for detecting and preventing overfitting

Algorithms for Playing and Solving games*

Bayesian Networks: Independencies and Inference

Probability Densities in Data Mining

A [somewhat] Quick Overview of Probability. Shannon Quinn CSCI 6900

CS570 Introduction to Data Mining

Lecture 7: DecisionTrees

Coding, Information Theory (and Advanced Modulation) Prof. Jay Weitzen Ball 411

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

ECE Information theory Final (Fall 2008)

Ma/CS 6b Class 25: Error Correcting Codes 2

Image and Multidimensional Signal Processing

MATH 433 Applied Algebra Lecture 21: Linear codes (continued). Classification of groups.

LECTURE 13. Last time: Lecture outline

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

9 THEORY OF CODES. 9.0 Introduction. 9.1 Noise

Decision Trees Lecture 12

channel of communication noise Each codeword has length 2, and all digits are either 0 or 1. Such codes are called Binary Codes.

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Discussion 6A Solution

Noisy channel communication

Computational Systems Biology: Biology X

(Classical) Information Theory III: Noisy channel coding

Information and Entropy. Professor Kevin Gold

Math 31 Lesson Plan. Day 5: Intro to Groups. Elizabeth Gillaspy. September 28, 2011

Information Theory, Statistics, and Decision Trees

Information & Correlation

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

16.36 Communication Systems Engineering

Gaussians. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

TITLE Build an Atom. AUTHORS Timothy Herzog (Weber State University) Emily Moore (University of Colorado Boulder) COURSE General Chemistry I

Bioinformatics: Biology X

Distributed Source Coding Using LDPC Codes

Coding for Digital Communication and Beyond Fall 2013 Anant Sahai MT 1

Lecture 11: Continuous-valued signals and differential entropy

CSC 411 Lecture 3: Decision Trees

DCSP-3: Minimal Length Coding. Jianfeng Feng

6.02 Fall 2011 Lecture #9

1. Basics of Information

Lecture 16 Oct 21, 2014

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Lecture 14: Hamming and Hadamard Codes

1 Introduction to information theory

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

appstats8.notebook October 11, 2016


Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Lecture 8: Shannon s Noise Models

Communication Theory II

Lecture 7 MIMO Communica2ons

Basic information theory

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Bohr Models are NOT Boring! How to Draw Bohr Diagrams

Cyclic Redundancy Check Codes

6.02 Fall 2012 Lecture #1

3F1 Information Theory, Lecture 1

! Where are we on course map? ! What we did in lab last week. " How it relates to this week. ! Compression. " What is it, examples, classifications

Physical Layer and Coding

X 1 : X Table 1: Y = X X 2

Lecture 1: Introduction, Entropy and ML estimation

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Noisy-Channel Coding

Chapter 2: Source coding

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Information Theory. Week 4 Compressing streams. Iain Murray,

3F1 Information Theory, Lecture 3

ECE Information theory Final

The Liar Game. Mark Wildon

ECS 452: Digital Communication Systems 2015/2. HW 1 Due: Feb 5

Modern Digital Communication Techniques Prof. Suvra Sekhar Das G. S. Sanyal School of Telecommunication Indian Institute of Technology, Kharagpur

Bayes Nets for representing and reasoning about uncertainty

Shannon s Noisy-Channel Coding Theorem

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Lecture 1: Shannon s Theorem

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

Communications II Lecture 9: Error Correction Coding. Professor Kin K. Leung EEE and Computing Departments Imperial College London Copyright reserved

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Algorithms for Data Science

Classification & Information Theory Lecture #8

Investigation of the Elias Product Code Construction for the Binary Erasure Channel

Convolutional Coding LECTURE Overview

BINARY CODES. Binary Codes. Computer Mathematics I. Jiraporn Pooksook Department of Electrical and Computer Engineering Naresuan University

Parity Checker Example. EECS150 - Digital Design Lecture 9 - Finite State Machines 1. Formal Design Process. Formal Design Process

Communications Theory and Engineering

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Unsupervised Learning: K-Means, Gaussian Mixture Models

10-701/15-781, Machine Learning: Homework 4

Transcription:

te to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright 2001, 2003, Andrew W. Moore Bits You are watching a set of independent random samples of X You see that X has four possible values P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4 So you might see: BAACBADCDADDDA You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100 Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 2 1

Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It s possible to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 3 Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It s possible to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? A B C D 0 10 110 111 (This is just one of several ways) Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 4 2

Fewer Bits Suppose there are three equally likely values P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3 Here s a naïve coding, costing 2 bits per symbol A B C 00 01 10 Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 5 General Case Suppose X can have one of m values V 1, V 2, V m P(X=V 1 ) = p 1 P(X=V 2 ) = p 2. P(X=V m ) = What s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X s distribution? It s H ( X ) = p1 p1 p2 p2 K = m j= 1 log 2 H(X) = The entropy of X High Entropy means X is from a uniform (boring) distribution Low Entropy means X is from varied (peaks and valleys) distribution Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 6 3

General Case Suppose X can have one of m values V 1, V 2, V m P(X=V 1 ) = p 1 P(X=V 2 ) = p 2. P(X=V m ) = A histogram of the What s the smallest possible number of frequency bits, on average, distribution per of symbol, needed A histogram to transmit of the a stream values of symbols of X would drawn have from X s distribution? frequency It s distribution of many lows and one or H ( X ) = values p1 log of X would 2 p1 be p flat two highs 2 p2 K = m j= 1 log 2 H(X) = The entropy of X High Entropy means X is from a uniform (boring) distribution Low Entropy means X is from varied (peaks and valleys) distribution Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 7 General Case Suppose X can have one of m values V 1, V 2, V m P(X=V 1 ) = p 1 P(X=V 2 ) = p 2. P(X=V m ) = A histogram of the What s the smallest possible number of frequency bits, on average, distribution per of symbol, needed A histogram to transmit of the a stream values of symbols of X would drawn have from X s distribution? frequency It s distribution of many lows and one or H ( X ) = values p1 log of X would 2 p1 be p flat two highs 2 p2 K = m j= 1 log 2..and so the values sampled from it would..and so the values sampled from it would be more predictable be all over the place H(X) = The entropy of X High Entropy means X is from a uniform (boring) distribution Low Entropy means X is from varied (peaks and valleys) distribution Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 8 4

Entropy in a nut-shell Low Entropy High Entropy Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 9 Entropy in a nut-shell Low Entropy High Entropy..the values (locations of..the values (locations soup) unpredictable... of soup) sampled almost uniformly sampled entirely from within throughout our dining room the soup bowl Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 10 5

Specific Conditional Entropy H(Y X=v) Suppose I m trying to predict output Y and I have input X X Y Let s assume this reflects the true probabilities E.G. From this data we estimate P(LikeG = ) = 0.5 P(Major = & LikeG = ) = 0.25 P(Major = ) = 0.5 P(LikeG = Major = ) = 0 te: H(X) = 1.5 H(Y) = 1 Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 11 Specific Conditional Entropy H(Y X=v) X Y Definition of Specific Conditional Entropy: H(Y X=v) = The entropy of Y among only those records in which X has value v Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 12 6

Specific Conditional Entropy H(Y X=v) X Y Definition of Specific Conditional Entropy: H(Y X=v) = The entropy of Y among only those records in which X has value v Example: H(Y X=) = 1 H(Y X=) = 0 H(Y X=) = 0 Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 13 Conditional Entropy H(Y X) Definition of Conditional Entropy: X Y H(Y X) = The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that row s value of X = Expected number of bits to transmit Y if both sides will know the value of X = Σ j Prob(X=v j ) H(Y X = v j ) Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 14 7

X Conditional Entropy Y Definition of Conditional Entropy: H(Y X) = The average conditional entropy of Y = Σ j Prob(X=v j ) H(Y X = v j ) Example: v j Prob(X=v j ) 0.5 0.25 0.25 H(Y X = v j ) Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 15 1 0 0 H(Y X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 X Y Information Gain Definition of Information Gain: IG(Y X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y X) = H(Y) - H(Y X) Example: H(Y) = 1 H(Y X) = 0.5 Thus IG(Y X) = 1 0.5 = 0.5 Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 16 8

Information Gain Example Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 17 Another example Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 18 9

Relative Information Gain X Y Definition of Relative Information Gain: RIG(Y X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line knew X? RIG(Y X) = H(Y) - H(Y X) / H(Y) Example: H(Y X) = 0.5 H(Y) = 1 Thus IG(Y X) = (1 0.5)/1 = 0.5 Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 19 What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find IG(LongLife HairColor) = 0.01 IG(LongLife Smoker) = 0.2 IG(LongLife Gender) = 0.25 IG(LongLife LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be. Copyright 2001, 2003, Andrew W. Moore Information Gain: Slide 20 10