Information Theory, Statistics, and Decision Trees
|
|
- Sophie Mason
- 6 years ago
- Views:
Transcription
1 Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010
2 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
3 I. Basic Information theory Léon Bottou 3/31 COS 424 4/6/2010
4 Why do we care? Information theory Invented by Claude Shannon in 1948 A Mathematical Theory of Communication. Bell System Technical Journal, October The quantity of information measured in bits. The capacity of a transmission channel. Data coding and data compression. Information gain A derived concept. Quantify how much information we acquire about a phenomenon. A justification for the Kullback-Leibler divergence. Léon Bottou 4/31 COS 424 4/6/2010
5 The coding paradigm Intuition The quantity of information of a message is the length of the smallest code that can represent the message. Paradigm Assume there are n possible messages i = 1... n. We want a signal that indicates the occurrence of one of them. We can transmit an alphabet of r symbols. For instance a wire could carry r = 2 electrical levels. The code for message i is a sequence of l i symbols. Properties Codes should be uniquely decodable. Average code length for a message: n x=1 p i l i. Léon Bottou 5/31 COS 424 4/6/2010
6 Prefix codes Messages 1 and 2 have codes one symbol long (l i = 1). Messages 3 and 4 have codes two symbols long (l i = 2). Messages 5 and 6,have codes three symbols long (l i = 2). There is an unused three symbol code. That s inefficient. Properties Prefix codes are uniquely decodable. There are trickier kinds of uniquely decodable codes, e.g. a 0, b 01, c 011 versus a 0, b 10, c 110. Léon Bottou 6/31 COS 424 4/6/2010
7 Kraft inequality Uniquely decodable codes satisfy n ( ) 1 li 1 r x=1 All uniquely decodable codes satisfy this inequality. If integer code lengths l i satisfy this inequality, there exists a prefix code with such code lengths. Consequences If some messages have short codes, others must have long codes. To minimize the average code length: - give short codes to high probability messages. - give long codes to low probability messages. Equiprobable messages should have similar code lengths. Léon Bottou 7/31 COS 424 4/6/2010
8 Kraft inequality for prefix codes Prefix codes satisfy Kraft inequality r l l i r l i i ( ) 1 li 1 r All uniquely decodable codes satisfy Kraft inequality Proof must deal with infinite sequences of messages. Given integer code lengths l i : Build a balanced r-ary tree of depth l = max i l i. For each message, prune one subtree at depth l i. Kraft inequality ensures that there will be enough branches left to define a code for each message. Léon Bottou 8/31 COS 424 4/6/2010
9 Redundant codes Assume i r l i < 1 There are leftover branches in the tree. There are codes that are not used, or There are multiple codes for each message. For best compression, i r l i = 1 This is not always possible with integer code lengths l i. But we can use this to compute a lower bound. Léon Bottou 9/31 COS 424 4/6/2010
10 Lower bound for the average code length Choose code lengths l i such that min p i l i subject to l 1...l n i i r l i = 1, l i > 0 Define s i = r l i, that is, l i = log r (s i ). Maximize C = p i log r (s i ) subject to i s i = 1 We get C s = p i i s i log(r) = Constant, that is s i p i. Replacing in the constraint gives s i = p i. Therefore l i = log r (p i ) and p i l i = i i p i log r (p i ) Fractional code lengths What does it mean to code a message on 0.5 symbols? Léon Bottou 10/31 COS 424 4/6/2010
11 Arithmetic coding An infinite sequence of messages i 1, i 2,... can be viewed as a number x = 0.i 1 i 2 i 3... in base n. An infinite sequence of symbols c 1, c 2,... can be viewed as a number y = 0.c 1 c 2 c 3... in base r. Léon Bottou 11/31 COS 424 4/6/2010
12 Arithmetic coding To encode a sequence of L messages i 1,..., i L. L The code y must belong to an interval of size p ik. It is sufficient to specify l(i 1 i 2... i L ) = L log r (p ik ) digits of y. Léon Bottou 12/31 COS 424 4/6/2010 k=1 k=1
13 Arithmetic coding To encode a sequence of L messages i 1,..., i L. It is sufficient to specify l(i 1 i 2... i L ) = The average code length per message is 1 L p L i1... p il log r (p ik ) i 1 i 2...i L k=1 L L log r (p ik ) p i1... p il L i 1 i 2...i L = 1 L L k=1 i 1...i L \i k k=1 ( h k p ih ) r L log r (p ik ) digits of y. k=1 i k =1 p ik log p ik = i p i log p i Arithmetic coding reaches the lower bound when L. Léon Bottou 13/31 COS 424 4/6/2010
14 Quantity of information Optimal code length: l i = log r (p i ). Optimal expected code length: p i l i = p i log r (p i ). Receiving a message x with probability p x : The acquired information is h(x) = log 2 (p x ) bits. An informative message is a surprising message! Expecting a message X with distribution p 1... p n : The expected information is H(X) = x X p x log 2 (p x ) bits. This is also called entropy. These are two distinct definitions! Note how we switched to logarithms in base two. This is a multiplicative factor: log 2 (p) = log r (p) log 2 (r). Choosing base 2 defines a unit of information: the bit. Léon Bottou 14/31 COS 424 4/6/2010
15 Mutual information Expected information: H(X) = i P (X = i) log P (X = i) Joint information: H(X, Y ) = i,j P(X = i, Y = j) log P (X = i, Y = j) Mutual information: I(X, Y ) = H(X) + H(Y ) H(X, Y ) Léon Bottou 15/31 COS 424 4/6/2010
16 II. Decision trees Léon Bottou 16/31 COS 424 4/6/2010
17 Car mileage Predict which cars have better mileage than 19mpg. mpg cyl disp hp weight accel year name buick skylark plymouth satellite ford galaxie chevrolet impala amc ambassador dpl plymouth cuda volvo 145e volkswagen peugeot renault ford pinto datsun chrysler new yorker... Léon Bottou 17/31 COS 424 4/6/2010
18 Questions Many questions can distinguish cars How many cylinders? (3,4,5,8) Displacement greater than 200 cu in? (yes, no) Displacement greater than x cu in? (yes, no) Weight greater than x lbs? (yes, no) Model name longer than x characters (yes, no) etc... Which question brings the most information about the task? Build contingency table. Compare mutual informations I(Question, M pg > 19). Possible answers ansa ansb ansc ansd mpg> mpg Léon Bottou 18/31 COS 424 4/6/2010
19 Mutual information Consider a contingency table, x ij. 1 j p refers to the question answers X. 1 i n refers to the target values Y. ansa ansb ansc ansd mpg> mpg Let x i = p j=1 x ij, x j = n i=1 x ij, and x = n i=1 p j=1 x ij. Mutual information: I(X, Y ) = H(X, Y ) + H(X) + H(Y ) = ij x ij x log x ij x j x j x log x j x i x i x log x i x Léon Bottou 19/31 COS 424 4/6/2010
20 Decision stump The question generates a partition of the examples. Now we can repeat the process for each node: build the contingency tables. pick the most informative question. Léon Bottou 20/31 COS 424 4/6/2010
21 Decision trees Until all leafs contain a single car. Léon Bottou 21/31 COS 424 4/6/2010
22 Decision trees Then label each leaf with class MP G > 19 or MP G 19. We can now say if a car does more than 19mpg by asking a few questions. But that is learning by heart! Léon Bottou 22/31 COS 424 4/6/2010
23 Pruning the decision tree We can label each node with its dominant class MP G > 19 or MP G 19. The usual picture. Should we use a validation set? Which stopping criterion? the node depth? the node population? Léon Bottou 23/31 COS 424 4/6/2010
24 The χ 2 independence test We met this test when studying correspondence analysis (lecture 10). x i = p j=1 x ij x j = n i=1 x ij x = n i=1 p j=1 x ij E ij = x i x j x If the rows and columns variables were independent X 2 = (x ij E ij ) 2 would asymptotically follow a χ 2 distribution E ij ij with (n 1)(p 1) degrees of freedom. Léon Bottou 24/31 COS 424 4/6/2010
25 Pruning a decision tree with the χ 2 test We want to prune nodes when the contingency table suggests that there is no dependence between the question and the target class. Compute X 2 = ij (x ij E ij ) 2 E ij for each node. Prune if 1 F χ 2(X) > p. Parameter p could be picked by cross-validation. But choosing p = 0.05 often works well enough. Léon Bottou 25/31 COS 424 4/6/2010
26 Conclusion Good points Decision trees run quickly. Decision trees can handle all kinds of input variables. Decision trees can be interpreted relatively easily. Decision trees can handle lots of irrelevant features. Bad points Decision trees are moderately accurate. Small changes in the training set can lead to very different trees. (were we speaking about interpretability... ) Notes Other names for decision trees: ID3, C4.5, CART. Regression tree when the target is continuous. Léon Bottou 26/31 COS 424 4/6/2010
27 III. Information theory and statistics Léon Bottou 27/31 COS 424 4/6/2010
28 Revisiting decision trees : likelihoods The tree as a model of P (Y X) Estimate P (Y X) by the target frequencies in the leaf for X. We can compute the likelihood of the data in this model. Likelihood gain when splitting a node Let x ij be the contingency table for a node and a question. Splitting the node with a question increases the likelihood: log L after log L before = ij x ij log x ij x j i x i log x i x = ij x ij log x ij x x x j i x i log x i x = ij x ij log x ij x j x j log x j x i x i log x i x Compare with slide 19. Léon Bottou 28/31 COS 424 4/6/2010
29 Revisiting decision trees : log loss The tree as a discriminant function p Define f(x) = log X where p 1 p X is the frequency of positive examples X in the leaf corresponding to X. ( ( log 1 + e yf(x)) log 1 1 p ) X p = log(p = X X ) if y = 1 ( log 1 p ) X 1 p = log(1 p X X ) if y = 1 Log loss reduction when splitting a node Let x ij be the contingency table for a node and a question. R before R after = x i log x i + x x ij log x ij x i j i j = x ij log x ij x x j log x j x x i log x i x ij j i Compare with slides 19 and 28. Note: regression trees use the mean squared loss. Léon Bottou 29/31 COS 424 4/6/2010
30 Kullback Leibler divergence Definition KL divergence between a true distribution P (X) and an estimated distribution P θ (X). D(P P θ ) = log P (x) P θ (x) dp (x) = = P (x) log P θ (x) } x {{} H approx H opt : Optimal coding length for X. x P (x) log P (x) P θ (x) P (x) log P (x) } x {{} H opt H approx : Expected code length for X when the code is designed for distribution P θ instead of the true distribution P. The KL divergence measures the excess coding bits when the code is optimized for the estimated distribution instead of the true distribution. Léon Bottou 30/31 COS 424 4/6/2010
31 Maximum Likelihood Minimize KL divergence min D(P P θ ) = log P (x) θ P θ (x) dp (x) max θ log P θ (x) dp (x) Maximize Log Likelihood max θ 1 n n log P θ (x i ) i=1 The log likelihood estimates Constant D(P P θ ) using the training set. Maximizing the likelihood minimizes an estimate of the excess coding bits obtained by coding the training set. One hopes to achieve a good coding performance on future data. The Vapnik-Chervonenkis theory gives confidence intervals for the deviation ( ) ( ) 1 n log P θ (x) dp (x) log P θ (x i ) n i=1 Léon Bottou 31/31 COS 424 4/6/2010
Expectation-Maximization
Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression,
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationInformation & Correlation
Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationInformation Theory and Statistics Lecture 2: Source coding
Information Theory and Statistics Lecture 2: Source coding Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Injections and codes Definition (injection) Function f is called an injection
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationEntropy as a measure of surprise
Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More informationCoding of memoryless sources 1/35
Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free
More informationLecture 7: DecisionTrees
Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:
More informationEE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16
EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt
More informationDecision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore
Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy
More informationEE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018
Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code
More informationUNIT I INFORMATION THEORY. I k log 2
UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper
More informationMachine Learning 3. week
Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which
More informationQuiz 2 Date: Monday, November 21, 2016
10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More informationChapter 9 Fundamental Limits in Information Theory
Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For
More informationInformation Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18
Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable
More informationChapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code
Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average
More information10-704: Information Processing and Learning Fall Lecture 10: Oct 3
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of
More informationELEMENT OF INFORMATION THEORY
History Table of Content ELEMENT OF INFORMATION THEORY O. Le Meur olemeur@irisa.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/ October 2010 1 History Table of Content VERSION: 2009-2010:
More informationAn instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1
Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationDATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering
DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,
More informationChapter 2: Source coding
Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More information3F1 Information Theory, Lecture 1
3F1 Information Theory, Lecture 1 Jossy Sayir Department of Engineering Michaelmas 2013, 22 November 2013 Organisation History Entropy Mutual Information 2 / 18 Course Organisation 4 lectures Course material:
More informationInformation theory and decision tree
Information theory and decision tree Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com June 14, 2018 Contents 1 Prefix code and Huffman
More informationInformation and Entropy
Information and Entropy Shannon s Separation Principle Source Coding Principles Entropy Variable Length Codes Huffman Codes Joint Sources Arithmetic Codes Adaptive Codes Thomas Wiegand: Digital Image Communication
More informationCSCI 2570 Introduction to Nanocomputing
CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information
More informationInformation Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes
Information Theory with Applications, Math6397 Lecture Notes from September 3, 24 taken by Ilknur Telkes Last Time Kraft inequality (sep.or) prefix code Shannon Fano code Bound for average code-word length
More informationMultimedia Communications. Mathematical Preliminaries for Lossless Compression
Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when
More informationCSC 411 Lecture 3: Decision Trees
CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning
More informationBioinformatics: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output
More information2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University
2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting
More informationIntro to Information Theory
Intro to Information Theory Math Circle February 11, 2018 1. Random variables Let us review discrete random variables and some notation. A random variable X takes value a A with probability P (a) 0. Here
More informationChapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code
Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way
More informationA Mathematical Theory of Communication
A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More information2018/5/3. YU Xiangyu
2018/5/3 YU Xiangyu yuxy@scut.edu.cn Entropy Huffman Code Entropy of Discrete Source Definition of entropy: If an information source X can generate n different messages x 1, x 2,, x i,, x n, then the
More informationInformation in Biology
Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve
More informationDecision Trees Lecture 12
Decision Trees Lecture 12 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Machine Learning in the ER Physician documentation Triage Information
More informationCS6304 / Analog and Digital Communication UNIT IV - SOURCE AND ERROR CONTROL CODING PART A 1. What is the use of error control coding? The main use of error control coding is to reduce the overall probability
More informationOutline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview
: Markku Juntti Overview The basic ideas and concepts of information theory are introduced. Some historical notes are made and the overview of the course is given. Source The material is mainly based on
More informationDigital communication system. Shannon s separation principle
Digital communication system Representation of the source signal by a stream of (binary) symbols Adaptation to the properties of the transmission channel information source source coder channel coder modulation
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationInformation Theory CHAPTER. 5.1 Introduction. 5.2 Entropy
Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationDecision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1
Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationEntropies & Information Theory
Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information
More informationLecture 8: Shannon s Noise Models
Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 8: Shannon s Noise Models September 14, 2007 Lecturer: Atri Rudra Scribe: Sandipan Kundu& Atri Rudra Till now we have
More informationMAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)
MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK SATELLITE COMMUNICATION DEPT./SEM.:ECE/VIII UNIT V PART-A 1. What is binary symmetric channel (AUC DEC 2006) 2. Define information rate? (AUC DEC 2007)
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationPART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015
Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:
More informationInformation Theory. Week 4 Compressing streams. Iain Murray,
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian
More informationInformation Theory & Decision Trees
Information Theory & Decision Trees Jihoon ang Sogang University Email: yangjh@sogang.ac.kr Decision tree classifiers Decision tree representation for modeling dependencies among input variables using
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationHomework Set #2 Data Compression, Huffman code and AEP
Homework Set #2 Data Compression, Huffman code and AEP 1. Huffman coding. Consider the random variable ( x1 x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0.11 0.04 0.04 0.03 0.02 (a Find a binary Huffman code
More information(Classical) Information Theory III: Noisy channel coding
(Classical) Information Theory III: Noisy channel coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract What is the best possible way
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationClassification: Decision Trees
Classification: Decision Trees Outline Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain and Gain Ratio 2 DECISION TREE An internal node is a test on an attribute. A
More informationInformation in Biology
Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationLecture 1: Shannon s Theorem
Lecture 1: Shannon s Theorem Lecturer: Travis Gagie January 13th, 2015 Welcome to Data Compression! I m Travis and I ll be your instructor this week. If you haven t registered yet, don t worry, we ll work
More informationInformation Theory: Entropy, Markov Chains, and Huffman Coding
The University of Notre Dame A senior thesis submitted to the Department of Mathematics and the Glynn Family Honors Program Information Theory: Entropy, Markov Chains, and Huffman Coding Patrick LeBlanc
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationAsymptotic redundancy and prolixity
Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends
More informationLearning Decision Trees
Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy
More information15-381: Artificial Intelligence. Decision trees
15-381: Artificial Intelligence Decision trees Bayes classifiers find the label that maximizes: Naïve Bayes models assume independence of the features given the label leading to the following over documents
More information3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions
Engineering Tripos Part IIA THIRD YEAR 3F: Signals and Systems INFORMATION THEORY Examples Paper Solutions. Let the joint probability mass function of two binary random variables X and Y be given in the
More informationLecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)
3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationChapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University
Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission
More informationDecision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag
Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More information6.02 Fall 2012 Lecture #1
6.02 Fall 2012 Lecture #1 Digital vs. analog communication The birth of modern digital communication Information and entropy Codes, Huffman coding 6.02 Fall 2012 Lecture 1, Slide #1 6.02 Fall 2012 Lecture
More informationEC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY
EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY Discrete Messages and Information Content, Concept of Amount of Information, Average information, Entropy, Information rate, Source coding to increase
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective
More informationAPC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words
APC486/ELE486: Transmission and Compression of Information Bounds on the Expected Length of Code Words Scribe: Kiran Vodrahalli September 8, 204 Notations In these notes, denotes a finite set, called the
More informationCoding for Discrete Source
EGR 544 Communication Theory 3. Coding for Discrete Sources Z. Aliyazicioglu Electrical and Computer Engineering Department Cal Poly Pomona Coding for Discrete Source Coding Represent source data effectively
More informationSIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding
SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,
More informationMAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A
MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK DEPARTMENT: ECE SEMESTER: IV SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A 1. What is binary symmetric channel (AUC DEC
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More information