Information Theory and ID3 Algo.

Similar documents
Decision Trees. Gavin Brown

Classification: Decision Trees

Learning Decision Trees

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Learning Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees. Tirgul 5

Classification & Information Theory Lecture #8

Information in Biology

Decision Tree Learning and Inductive Inference

Machine Learning 2nd Edi7on

Information in Biology

Induction on Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Machine Learning Alternatives to Manual Knowledge Acquisition

Machine Learning Recitation 8 Oct 21, Oznur Tastan

the tree till a class assignment is reached

Rule Generation using Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

6.02 Fall 2011 Lecture #9

Classification and Prediction

Learning Classification Trees. Sargur Srihari

Classification Using Decision Trees

Communication Theory and Engineering

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Tree Learning - ID3

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

10-701/ Machine Learning: Assignment 1

Decision Tree Learning

Chapter 3: Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

The Solution to Assignment 6

Decision T ree Tree Algorithm Week 4 1

Decision Trees / NLP Introduction

Dan Roth 461C, 3401 Walnut

CS 6375 Machine Learning

Information Theory, Statistics, and Decision Trees

Lecture 9: Bayesian Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Administrative notes. Computational Thinking ct.cs.ubc.ca

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Classification and Regression Trees


Lecture 3: Decision Trees

Artificial Intelligence. Topic

3F1 Information Theory, Lecture 1

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Bayesian Classification. Bayesian Classification: Why?

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Bayesian Learning. Bayesian Learning Criteria

Decision Support. Dr. Johan Hagelbäck.

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Decision Trees Part 1. Rao Vemuri University of California, Davis

EECS 349:Machine Learning Bryan Pardo

Etymology of Entropy. Definitions. Shannon Entropy 3/3/2008. Information Entropy: Illustrating Example. Entropy = randomness. Amount of uncertainty

Algorithms for Classification: The Basic Methods

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision Tree Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CSCE 478/878 Lecture 6: Bayesian Learning

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

1 Introduction to information theory

Decision Trees. Danushka Bollegala

Data Mining. Chapter 1. What s it all about?

Information & Correlation

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

6.02 Fall 2012 Lecture #1

CS 630 Basic Probability and Information Theory. Tim Campbell

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Probability Based Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning CMU-10701

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning. Dr. Xiaowei Huang

Decision Tree Learning

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Etymology of Entropy. Definitions. Shannon Entropy. Information Entropy: Illustrating Example. Entropy = randomness. Amount of uncertainty

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Stephen Scott.

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Chapter 6: Classification

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lecture 1: Introduction, Entropy and ML estimation

1 Basic Information Theory

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

INTRODUCTION TO INFORMATION THEORY

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Transcription:

Information Theory and ID3 Algo. Sohn Jong-Soo mis026@korea.ac.kr 2007.10.09

Claude Elwood Shannon The father of information theory 본적 : Petosky, Michigan 논문 : A Mathematical Theory of Communication, BSTJ 1948 등 경력 : Researcher at Bell Lab(1941~1972), MIT Professor(1956~2001) 학력 : B.S. in Univ. of Michigan (EE & AM 1936) M.S. in MIT (1937) : A Symbolic Analysis of Relay and Switching Circuits Ph.D in MIT (1940) : An algebra for Theoretical Genetics 생년월일 : 1916.4.30 직업 : Electrical engineer, Mathematician 병역 : 2 차대전시암호해독가로활약 사망 : 2001.2.24, Medford, Mass. Alzheimer`s disease 가족관계 : Betty(Wife), Andrew(Son), Margarita(Daughter)

Fundamentals of Information Theory Three fundamental problems Data compression Sampling (Digitalizing) Classification Q : How to measure information? Concept of Entropy 111111000000 : Low Entropy 110101011010 : High Entropy 정보량 : (Bits) 예 : 정상인동전의엔트로피 : 1 (Bits) 비정상인동전 ( P( 앞면 )=0.9) 의엔트로피 : 0.274 (Bits) 로또 (6/45) 의엔트로피 : 0.0000028185 (Bits)

Fundamentals of Information Theory " 여기세로가로 4 장씩 16 장의트럼프가있다. 이중한장만머릿속에점찍어두시오 " " 예, 점찍었습니다 " " 그것은상단에있는가?" " 그렇습니다 " " 그럼상단의오른쪽반에있는가?" " 아닙니다 " " 그럼왼쪽반의상단에있는가?" " 아닙니다 " " 그럼하단의오른쪽에있는가?" " 그렇습니다 " " 당신이점찍은것은크로바 3 이군요 " 첫째, 카드의수. 여기서는 16 장이므로아무런정보도없는상황에서알아맞힐확률은 1/16 둘째, 상대의대답의종류. 여기서는예, 아니오두종류만허용 셋째, 질문횟수. 여기서는 4 회.16 장의카드에서특정카드를알아맞히는데 4 회의질문이필요

Fundamentals of Information Theory 2 4 = 16 (generalizing) W = 2 n W : number of cards n : number of questions 16 장의카드에서 1 장을알아맞히는데 4 회의질문이필요 log 2 16=4 16 장의카드에서특정 1 장을고르는정보량은 4bit logw = n log2 n = logw/log2 n = log 2 W 16 장에서 1 장을고르는확률은 1/16 이므로이확률을중심으로정보량을표현하면 n = -log 2 1/16 n = -log 2 P I( s s s m i 1,s2,...,sm ) = log 2 i= 1 s s i

Fundamentals of Information Theory = - P 1 log 2 P 1 = - P 2 log 2 P 2 E = - P 1 log 2 P 1 - P 2 log 2 P 2 E(A) = v j= 1 s 1 j +... + s s mj I( s 1 j,...,s mj ) Gain(A) = I(s 1,s 2,...,sm) E(A)

Entropy X: discrete random variable, p(x) Entropy (or self-information) H(p) = H(X) = x X p(x)log 2 p(x) Entropy measures the amount of information in a RV it s the average length of the message needed to transmit an outcome of that variable using the optimal code

Joint Entropy The joint entropy of 2 RV X,Y is the amount of the information needed on average to specify both their values H(X, Y) = p(x, y)logp(x, Y) x X y Y

Conditional Entropy The conditional entropy of a RV Y given another X expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X It s also called the equivocation ( 애매함 ) of X about Y H(Y X) = = = p(x) x X x X y Y x X y Y p(x)h(y X = p(y x)logp(y p(x, y)logp(y x) x) x) = E ( logp(y X) )

Mutual Information H(X, Y) H(X)-H(X Y) = H(X) + H(Y X) = = H(Y)- H(Y X) H(Y) + H(X Y) = I(X, Y) I(X,Y) is the mutual information between X and Y. It is the reduction of uncertainty of one RV due to knowing about the other, or the amount of information one RV contains about the other

Mutual Information (cont) I(X, Y) = H(X) -H(X Y) = H(Y) - H(Y X) I is 0 only when X,Y are independent: H(X Y)=H(X) H(X)=H(X)-H(X X)=I(X,X) Entropy is the self-information

ID3 algorithm ID3 Nonincremental algorithm meaning it derives its classes from a fixed set of training instances. The classes created by ID3 are inductive, that is, given a small set of training instances, the specific classes created by ID3 are expected to work for all future instances. The distribution of the unknowns must be the same as the test cases. Induction classes cannot be proven to work in every case since they may classify an infinite number of instances. Note that ID3 (or any inductive algorithm) may misclassify data.

ID3 algorithm Data Description The sample data used by ID3 has certain requirements, which are: Attribute-value description the same attributes must describe each example and have a fixed number of values. Predefined classes an example's attributes must already be defined, that is, they are not learned by ID3. Discrete classes classes must be sharply delineated. Continuous classes broken up into vague categories such as a metal being "hard, quite hard, flexible, soft, quite soft" are suspect.

ID3 algorithm Attribute Selection information gain, is used The one with the highest information (information being the most useful for classification) is selected. Entropy measures the amount of information in an attribute. Given a collection S of c outcomes Entropy(S) = -p(i) log 2 p(i) where p(i) is the proportion of S belonging to class I. S is over c. Log2 is log base 2. Note that S is not an attribute but the entire sample set.

ID3 algorithm - Example should we play baseball? Over the course of 2 weeks, data is collected to help ID3 build a decision tree (see following table) The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following values: outlook = { sunny, overcast, rain } temperature = {hot, mild, cool } humidity = { high, normal } wind = {weak, strong }

ID3 algorithm - Example

ID3 algorithm - Example We need to find which attribute will be the root node in our decision tree. The gain is calculated for all four attributes: Gain(S, Outlook) = 0.246 Gain(S, Temperature) = 0.029 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root node.

ID3 algorithm - Example Since Outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is "what attribute should be tested at the Sunny branch node? S sunny = {D1, D2, D8, D9, D11} = 5 Gain(S sunny, Humidity) = 0.970 Gain(S sunny, Temperature) = 0.570 Gain(S sunny, Wind) = 0.019 Humidity has the highest gain it is used as the decision node.

ID3 algorithm - Example

ID3 algorithm - Example The decision tree can also be expressed in rule format: IF outlook = sunny AND humidity = high THEN playball = no IF outlook = rain AND humidity = high THEN playball = no IF outlook = rain AND wind = strong THEN playball = yes IF outlook = overcast THEN playball = yes IF outlook = rain AND wind = weak THEN playball = yes

Next presentation OLAM Engine User GUI API Data Cube API OLAP Engine Layer4 User Interface Layer3 OLAP/OLAM MDDB Meta Data Layer2 MDDB Filtering&Integration Databases Database API Data cleaning Data integration Filtering Data Warehouse Layer1 Data Repository

Next presentation Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67 Woodman Canada Richmond Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70 Lachance Canada Richmond Laura Lee F Physics Removed Retained Sci,Eng, Bus Seattle, WA, USA 25-8-70 125 Austin Ave., Burnaby 420-5232 3.83 Country Age range City Removed Excl, VG,..