Machine Learning Alternatives to Manual Knowledge Acquisition Interactive programs which elicit knowledge from the expert during the course of a conversation at the terminal. Programs which learn by scanning texts. Programs which learn the concepts of a domain under varying degrees of supervision from a human teacher. Lecture 11 Machine Learning 1
Inductive Learning Inductive learning is a form of supervised learning which involves learning from examples by a process of generalization. The learning task is to identify or construct the relevant concept, i.e., the concept which includes all of the positive examples and none of the negative examples. This kind of learning is often called concept learning Lecture 11 Machine Learning 2
Concept Learning Problem A concept can be conceived of as a pattern which states those properties which are common to instances of the concept. Given (i) a language of patterns for describing concepts, (ii) sets of positive and negative instances of the target concept, and (iii) a way of matching data in the form of training instances against hypothetical descriptions of the output, the task is to determine concept description in the language that are consistent with the training instances. Lecture 11 Machine Learning 3
Generality and Specificity P1 STANDING BRICK SUPPORTS LYING WEDGE or BRICK P2 not LYING any shape TOUCHES any orientation WEDGE or BRICK P1 and P2 both represent the following patterns, but P1 is more specific than P2. Lecture 11 Machine Learning 4
Representation Language Properties and values of each car in the concept space: Origin {Japan, USA, Britain, Germany, Italty} Manufacturer {Honda, Toyota, Ford, Chrysler, BMW} Color {Blue, Green, Red, White} Decade {1960, 1970, 1980, 1990, 2000} Type {Economy, Luxury, Sports} A car is represented by an ordered list: (x 1, x 2, x 3, x 4, x 5 ) Thus the concept of Japanese economy car will be (Japan, x 2, x 3, x 4, Economy) Lecture 11 Machine Learning 5
Partial Ordering of Concepts (x 1, x 2, x 3, x 4, x 5 ) (Japan, x 2, x 3, x 4, x 5 ) (x 1, x 2, x 3, x 4, Economy)...... (Japan, x 2, x 3, x 4, Economy) (USA, x 2, x 3, x 4, Economy)...... (Japan,Honda,White,1990,Economy) (USA,Chrysler,Green,1980,Economy)... Lecture 11 Machine Learning 6
A Training Set origin mfr color decade type pos/neg Japan Honda Blue 1980 Economy pos Japan Toyota Green 1970 Sports neg Japan Toyota Blue 1990 Economy pos USA Chrysler Red 2000 Economy neg Japan Honda White 1980 Economy pos Lecture 11 Machine Learning 7
Version Space The set of maximally general patterns (G) The set of maximally specific patterns (S) All concept descriptions which occur between these two sets (Version Space) in the partial ordering Boundary of S Version Space??? Boundary of G?? + + +? + + + + +? +???? Lecture 11 Machine Learning 8
Candidate Elimination Algorithm 1. Initialize G to contain the most general descriptions (i.e. all features are variables). 2. Initialize S to contain the first positive example 3. Accept a new training example If positive example, remove from G any descriptions that do not cover the example. Then update S to contain the most specific set of descriptions in the version space that cover the example and the current elements of S, i.e. generalize S as little as possible. If negative example, remove from S any descriptions that cover the example. Then update G to contain the most general set of descriptions in the version space that do not cover the example, i.e. specialize G as little as possible. 4. If G=S and both are singletons, output their value and halt. If G and S are singletons but G S, then training cases are inconsistent. Output result and halt. Otherwise, go to Step 3. Lecture 11 Machine Learning 9
A Search of Version Space 1 st Example (pos): (Japan, Honda, Blue, 1980, Economy) G={(x 1, x 2, x 3, x 4, x 5 )} S={(Japan, Honda, Blue, 1980, Economy)} 2 nd Example (neg): (Japan, Toyota, Green, 1970, Sports) G={(x 1, Honda, x 3, x 4, x 5 ), (x 1, x 2, Blue, x 4, x 5 ), (x 1, x 2, x 3, 1980, x 5 ), (x 1, x 2, x 3, x 4, Economy)} S={(Japan, Honda, Blue, 1980, Economy)} 3 rd Example (pos): (Japan, Toyota, Blue, 1990, Economy) G={(x 1, x 2, Blue, x 4, x 5 ), (x 1, x 2, x 3, x 4, Economy)} S={(Japan, x 2, Blue, x 4, Economy)} Lecture 11 Machine Learning 10
A Search of Version Space (contd.) 4 th Example (neg): (USA, Chrysler, Red, 2000, Economy) G={(Japan, x 2, Blue, x 4, x 5 ), (Japan, x 2, x 3, x 4, Economy)} S ={(Japan, x 2, Blue, x 4, Economy)} 5 th Example (pos): (Japan, Honda, White, 1980, Economy) G={(Japan, x 2, x 3, x 4, Economy)} S ={(Japan, x 2, x 3, x 4, Economy)} Lecture 11 Machine Learning 11
Meta-DENDRAL Meta-DENDRAL is an expert system that helps chemists determine the dependence of mass spectrometric fragmentation on substructural features. It does this by discovering fragmentation rules for given classes of molecules. The system derives these rules from training instances consisting of sets of molecules with known 3-D structures and mass spectra. Meta-DENDRAL uses Candidate Elimination Algorithm. It first generates a set of highly specific rules which account for a single fragmentation in a particular molecule. Then it uses the training examples to generalize these rules. Lecture 11 Machine Learning 12
Decision Trees as Knowledge Representation Rules are not the only way of representing attribute-value information about concepts for the purpose of classification. Decision trees are an alternative way of structuring such information. Quinlan defines decision trees as structures that consist of Leaf nodes, representing a class, and Decision nodes, spe some test to be carried out on a single attribute value, with one branch for each possible outcome of the test. Lecture 11 Machine Learning 13
A Training Set: Play/Don t Play No. Outlook Temperature Humidity Windy Class 1 sunny hot high false N 2 sunny hot high true N 3 overcast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N Lecture 11 Machine Learning 14
Decision Tree Derived from Training Set outlook sunny overcast rain humidity P windy high normal true false N P N P Lecture 11 Machine Learning 15
Classification Rule based on Decision Tree If Then outlook = overcast outlook = sunny & humidity = normal outlook = rain & windy = false P Lecture 11 Machine Learning 16
ID3 Algorithm Given (1) a set of disjoint target classes (C 1, C 2,, C k ), and (2) a set of training data, S, containing objects of more than one class ID3 uses a series of tests to refine S into subsets that contain objects of only one class. ID3 builds a decision tree, where non-terminal nodes correspond to tests on a single attribute of the data, and terminal nodes correspond to classified subsets of the data. Let T be any test on a single attribute. Thus T produces a partition {S 1, S 2,, S n } based on outcome O 1, O 2,,O n : S i = {x T(x) = O i } Lecture 11 Machine Learning 17
Tree Structure of Partitioned Objects O 1 O 2..... O n S S 1 S 2....... S n Lecture 11 Machine Learning 18
Information Theory Consider a set of message M = {m 1, m 2,, m n } Each message mi has probability p(m i ) of being received and contains an amount of information I(m i ) as follows: I(m i ) = log 2 p(m i ) The uncertainty (or entropy) of a message set U(M) is the sum of information in the possible messages weighted by their probabilities: U(M) = Σ i p(m i )log 2 p(m i ) for i = 1 to n Lecture 11 Machine Learning 19
Building Decision Trees in ID3 Let N i stand for the number of cases in S that belong to class C i. Then the probability that a random case c belongs to class C i is estimated to be: Ni p( c Ci ) = S Thus the amount of information in a message of class C i is: I c C ) = log p( c C ) bits ( i 2 i Consider the set of target classes as a message set {C 1, C 2,,C k }. The uncertainty U(S) measures the average amount of information need to determine the class of a random case, c S, prior to partitioning by any test. Thus: U ( S) = ( ) ( ) i = p c C 1to k i I c Ci bits Lecture 11 Machine Learning 20
Building Decision Trees in ID3 (contd.) Consider a similar uncertainty measure after S has been partitioned into {S 1, S 2,, S n } by a test T: U T ( S) = Si U i = 1to n S i ( S i ) U T (S) measures how much information is needed for the partitioning. Thus ID3 decides what attribute to branch on next by selecting the test T that gains the most information, i.e. maximum G S (T) given below: G S (T) = U(S) U T (S) Lecture 11 Machine Learning 21
S = {P, N } Play/Don t Play Example U ( S) = p /( p + n)log p /( p + n) n /( p + n)log n /( p + n) = (9 /14)log 2 2 (9 /14) (6/14) log (6/14) = 0.9338 For T = Outlook, {S 1, S 2, S 3 } = {sunny, overcast, rain} U(sunny) = (2/5)log 2 (2/5) (3/5)log 2 (3/5) = 0.971 U(overcast) = (4/4)log 2 (4/4) (0/4)log 2 (0/4) = 0 U(rain) = (3/5)log 2 (3/5) (2/5)log 2 (2/5) = 0.971 U Outlook (S) = (5/14) 0.971 + (4/14) 0 + (5/14) 0.971 = 0.6936 G S (Outlook) = U(S) U Outlook (S) = 0.9338 0.694 = 0.2402 2 2 Lecture 11 Machine Learning 22
Play/Don t Play Example (contd.) Similarly, U Temperature (S) = 0.9226 U Humidity (S) = 0.9177 U Windy (S) = 0.8922 G S (Temperature) = U(S) U Temperature (S) = 0.9338 0.9226 = 0.0112 G S (Humidity) = U(S) U Humidity (S) = 0.9338 0.9177 = 0.0161 G S (Windy) = U(S) U Windy (S) = 0.9338 0.8922 = 0.0416 Thus T = Outlook has the highest information gain and is thus chosen as the root. Lecture 11 Machine Learning 23
C4.5 C4.5 is a suite of programs that embody the ID3 algorithm. The gain criterion is defined as a gain ratio, H S (T) in C4.5: H S ( T) = G S ( T ) V ( S) where V = Si Si S) log i = 1 to n S S ( 2 The new heuristics is to select a test that maximizes the gain ratio. Lecture 11 Machine Learning 24