Machine Learning 2010

Size: px

Start display at page:

Download "Machine Learning 2010"

Garey Douglas
5 years ago
Views:

1 Machine Learning 2010 Concept Learning: The Logical Approach Michael M Richter mrichter@ucalgary.ca 1 -

2 Part 1 Basic Concepts and Representation Languages 2 -

3 Why Concept Learning? Concepts describe properties of a given set of objects. Examples: The set of prime numbers: This set is precisely described, according to the definition there is no doubt. The property of having a certain type of lung infection. Again, we have certain indicators but a more precise definition can improve the therapy. Special concepts are definitions of functions. Because of the lack of definitions we have to rely on observations and to learn from them. 3 -

4 Learning a Function Examples for the values of a function f: f(0) = -1, f(1) = -1, f(2) = -1, f(3) = 17, f(4) = 399 A possible hypothesis that is correct for the shown examples is f(x) = x x+1 - (x+1) x Occam`s principle says: Use the simplest description describing the seen examples correctly. Depending on the criteria for simplicity, this is at least a good (and correct) description. 4 -

5 Concept Learning (1) Basics Different Representation Languages Learning by Searching complete enumeration Searching by generalization search by most special generalization search by most general generalization Version space method 5 -

6 Concept Learning (2) There are different ways to represent concept: Logical formulas (this chapter) Trees (decision tree chapter) Numerical methods (support vector machine chapter) They all have different advantages and disadvantages which we will discuss. 6 -

7 Classification task: given: Basic set M, Set of class indices I Classification Goal: Assign to each element form M a class index from I Example: Basic set = Set of bank customers class indices I = { credit worthy, not credit worthy } Goal: Classify the bank customers 7 -

8 Classifier Definition: A classifier for some set M is a mapping f : M I, where I is a set, called index set. If I = {0, 1}, then we denote P = { x f(x) = 1} the set of positive elements and N = { x f(x) = 0} the set of negative elements 8 -

9 Classifier Descriptions Distinguish: Classifiers and Classifier descriptions Different possibilities for descriptions: Enumeration of all elements (possible only for finite sets) Presenting a formula of predicate logic Presenting a C-Program... Observe: A classifier description uniquely determines a classifier A classifier can have different classifier descriptions 9 -

10 Different Representation Languages How are examples and hypotheses (possible concepts) represented? Examples and hypotheses use often the same representation language This has an influence on Learning algorithms and complexity Definition of the more general relation. 10 -

11 Conjunctive Concepts (1) Representation by attribute-value pairs: Attributes A 1,...,A n Associated domains T 1,...,T n Examples are n-ary vectors (w 1, w 2,..., w n ) with w i T i Equivalent logical representation: (A 1 =w 1 ) (A 2 =w 2 ) (A n =w n ) A concept now is an n-ary vector (w 1, w 2,..., w n ) with w i T i * (meaning of *: don t care or don t know) The example (w 1, w 2,..., w n ) belongs to the concept (w 1, w 2,..., w n ) iff " i=1,...,n (w i = w i ) (w i = *) 11 -

12 Conjunctive Concepts (2) Checking the more general relation: (w 1,...,w n ) (w 1,...,w n ) " i=1,...,n (w i = w i ) (w i = *) The -relation is a partial ordering Example: Attributes : A 1 : Size, A 2 : Shape positive example: (small, circle) negative example: (large, triangle) (small, circle) (small, *) (*,circle) 12 -

13 Generality of Concepts Definition: A concept K 1 is more general than a concept K 2 (notation: K 2 K 1 ) iff for all e M we have: if K 2 (e) then K 1 (e) is also true. The inverse relation is more special than. K 2 K 1 We have: K 2 K 1 K 3 K 1 K

14 Most Special Generalizations Definition: A concept K is a most special generalization of a given set E of examples iff K is a complete and consistent concept description for E and for all other complete and consistent concept generalizations K for E we have : if K K then also K K Example: not complete and consistent complete and consistent Most special generalizations 14 -

15 Most General Specializations Definition: A concept K is a most generalal specialization of a given set E of examples iff K is a complete and consistent concept description for E and for all other complete and consistent concept specialization K for E we have : if K K then also K K Example: not complete and consistent complete and consistent Most general specializations 15 -

16 Representation by Rules Examples: variable free rules Concepts: arbitrary rules Definition: For rules F and Y we have F Y (F is more general than Y) iff there is a substitution s such that Y = s(f) holds Example: Has_feaver(Bill) Sniffles(Bill) Coughs(Bill) Has_cold(Bill) Has_feaver(X) Sniffles(X) Coughs(X) Has_cold(X) Has_feaver(X) Sniffles(X) Has_cold(X) 16 -

17 Definitions: Properties of Concepts A concept K is complete for some set E of examples iff for all e E holds: if e positive then K(e) is true. A concept K is consistent for some set E of examples iff for all e E holds: if e negative then K(e) is true. Goal: Find complete and consistent concepts complete, not complete, complete., not consistent consistent consistent 17 -

18 Part 2 Learning Methods and Algorithms 18 -

19 Learning Concepts Determine the concept on the basis of some given classified examples (experiences). Experiences positive examples negative examples Learning function concept K(X) The generated concepts are hypotheses. Learning tells us only that something is a plausible explanation but never that it is always true. 19 -

20 Incremental vs. Non-Incremental Learning Non-incremental concept learning: Given: set P of positive examples set N of negative examples Wanted: concept Incremental concept learning: Given: actual concept (hypothesis) new example (positive or negative) Wanted: updated concept 20 -

21 Idea of the Version Space l more general concepts inconsistent concepts G negative examples positive examples more special concepts S incomplete concepts H: version space (all possible solutions) Shrinking of the version performs a sear ch 21 -

22 Classification with the Version Space (1) Application of the version space for classifying new elements a of the underlying set M Definitions: An element a M is classified as a positive element if and only iff "K H we have: K(a). Equivalent definition but simpler to verify: An element a M is classified as a positive element if and only if "K S we have: K(a). G: S:

23 Classification with the Version Space (2) Definition: An element a M is classified as a negative element if and only if "K H we have: K(a). Equivalent definition (simpler to verify): An element a M is classified as a negative element if and only if "K G we have: K(a). G: - S: 23 -

24 Classification with the Version Space (3) Definiti: An element a is not classified if and only if $ K 1 H with K 1 (a) and $ K 2 H with K 2 (a) Equivalent definition (simpler to verify): An element a is not classified if and only if $ K 1 G with K 1 (a) and $ K 2 S with K 2 (a) G: + S:

25 VS - Algorithm for Conjunctive Concepts (1) Given: Set of examples a 1,...,a n, with a 1 positive Initialize (when a 1 is presented): S:= a 1, G:={(*,...,*)} For each a i (i = 2... n) DO: IF a i positive THEN FOR each K G DO IF K does not include a i THEN G := G \ {K} S := most special generalization of S which includes a i IF G = { } OR ($K G with K S) THEN STOP Failure IF G = {S} THEN STOP Success with S ; 25 -

26 VS - Algorithm for Conjunctive Concepts (2) IF a i negative IF S includes a i THEN STOP Failure G := {} FOR EACH K G DO G = G G := G { K K is most general specialization of K which excludes a i and S K } IF G = { } THEN STOP Failure IF G = {S} THEN STOP Success with S Success means the search has terminated If no more examples exist: STOP Version space: S, G 26 -

27 Properties of the Algorithm For a finite number of examples the algorithm always terminates. Correctness: 1. If the correct concept K is in the space of hypotheses (I.e. has a conjunctive description) then the algorithm terminates after sufficiently many examples are presented with Success and K = S OR Terminates after all examples are processed. The intended concept K is an element of the remaining version space. 2. If the algorithm terminates with Failure then the concept K is not an element of space of hypotheses. 3. If the concept K is not an element of space of hypotheses nothing can be asserted. If the algorithm stops because it is running out of examples the version space generated so far can still be used to classify many objects

28 Example 1 Attributes: Size (small, large) ; Shape (circle, triangle, square) (*,*) (*,ci) (*,sq) (*,tr) (sm,*) (la,*) (sm,ci) (la,ci) (sm,sq) (la,sq) (sm,tr) (la,tr) a 1 = (sm,ci) positive S = (sm,ci) G = {(*,*)} a 2 = (la,tr) negative S = (sm,ci) G = { (*,ci), (sm,*) } a 3 = (la,ci) positive S = (*,ci) G = { (*,ci) } Success! 28 -

29 Example 2 (*,*) (*,ci) (*,sq) (*,tr) (sm,*) (sm,*) (sm,ci) (la,ci) (sm,sq) (la,sq) (sm,tr) (la,tr) We try to learn the concept small circle or large triangle : a 1 = (sm,ci) positive S = (sm,ci) G = {(*,*)} a 2 = (la,ci) negative S = (sm,ci) G = { (sm,*) } a 3 = (la,tr) positive S = (*,*) G = {} Failure! 29 -

30 Example 3 (*,*) (*,ci) (*,sq) (*,tr) (sm,*) (la,*) (sm,ci) (la,ci) (sm,sq) (la,sq) (sm,tr) (la,tr) We try again to learn the concept small circle or large triangle but with a different ordering of the examples: a 1 = (sm,ci) positive S = (sm,ci) G = {(*,*)} a 2 = (la,tr) positive S = (*,*) G = {(*,*)} Success! (but The algorithm generalizes too much because the concept is not expressed as a conjunction! The algorithm presents an incorrect result. 30 -

31 Remark The version space algorithm finds only concepts that are in the version space (the hypothesis space). In many applications it cannot be guaranteed a priori that the wanted concept is in the hypothesis space. What to do? Intuitively: Find the best concept that is available in the hypothesis space. But how to define best and how to find it? This leads to approximative learning and is discussed in the section on PAC-learning. 31 -

32 Example Task: Generate a classifier for assigning bank customers one of the classes {credit worthy, not credit worthy} Traditional approach: Knowledge acquisition from banking experts Programming a special classificator Using a learning system : Make use of experiences from the past: Take the set of previous customers and their classifications {credit worthy, not credit worthy} Automatic learning of a classificator (concept learning of a concept from positive and negative examples). 32 -

33 Algorithm: Complete Enumeration Input: P: set of positive examples N: set of negative examples Output: H: set of all complete, consistent concepts H := {} FOR each concept K DO IF K contains all examples from P AND K contains no example from N THEN H := H {K} RETURN H 33 -

34 Properties of the Algorithm Assumptions: Representation language is finite Membership of examples to a concept is decidable Disadvantages: Very inefficient No differentiation between the learned concepts Advantages: Very easy to realize 34 -

35 Improving the Search Idea: Make use of the more general relation Control of the search: from special to general concepts from general to special concepts combined search 35 -

36 Input: Algorithm P = {p 1,...,p m }: set of positive examples; N: set of negative examples (w 1,...,w n ) := p 1 ; FOR i=2 to m DO Sei p i = (w 1,...,w n ) FOR j=1 to n DO IF w j w j THEN w j := * IF (w 1,...,w n ) contains an example from N THEN RETURN {} ELSE RETURN { (w 1,...,w n ) } 36 -

37 Properties: Properties of the Algorithm Algorithm terminates and returns the most special generalization of the examples if it exists, otherwise the empty set. Assumption: finite domains Advantages: very efficient Disadvantage: Algorithms works in this simple form only for conjunctive descriptions 37 -

38 Learning by Breadth First Search Search Direction: General -> Special Learning goal: Learning of some concept Finding the most general generalization Experiences: positive and negative examples attribute-value representation Hypotheses space: conjunctive concept descriptions Example presentation: non-incremental Search strategy: Breadth first Search direction: general -> special 38 -

39 Algorithm Input: P: set of positive examples; N: set of negative examples C := {}; H:= { (*,...,*) }; WHILE TRUE FOR ALL h H DO IF NOT h contains all examples from P THEN H := H \ {h} IF h contains no example from N THEN H := H \ {h}; C := C {h} IF H = {} THEN RETURN C H := {} (* Generate specializations *) FOR ALL (w 1,...,w n ) H DO FOR i {1,..., n} with w i = * DO H := H FOR ALL w i T i DO IF NOT [ $ h C with h (w 1,...,w i-1,w i,w i+1,...,w n ) ] THEN H := H {(w 1,...,w i-1,w i,w i+1,...,w n )} 39 -

40 Properties of the Algorithm Properties: Algorithm terminates and delivers the set of all most general generalizationen of the examples Assumption: Finite domains Disadvantages: inefficient therefore impractible for large numbers of attributes and large domains But observe: The algorithm can be extended to other representations. 40 -

41 The General Version Space Method VS Concept learning from positive and negative examples The examples are represented incrementally bidirectional breadth-first search, i.e. combining the two directions general -> special special -> general 41 -

42 Generalization Operator Given: concept K = (w 1,...,w n ) positive examples B = (w 1,...,w n) K K B Algorithm: FOR j=1 to n DO IF w j w j THEN w j := * RETURN (w 1,...,w n ) 42 -

43 Specialization Operator Given: concept K = (w 1,...,w n ) negative examples B = (w 1,...,w n) K B K 1 K 2 K m Algorithm: IF concept K does not contain example B THEN RETURN { K } S = {} FOR j=1 to n DO IF w j = * THEN FOR EACH w T j / { w j } DO S = S {(w 1,...,w j-1,w,w j+1,...,w n ) RETURN S 43 -

44 General VS - Algorithm (1) Given: Set of examples a 1,...,a n, with a 1 positive Initialize (if example a 1 is presented): S:= { K } where K is the most special concept description containing a 1 G:= { A } where A is the most general concept description Processing the examples a i (i > 1): IF a i is positive THEN 1. Remove from G all concepts that do not contain a i. 2. Replace the concepts s S by the most special generalizations of s that contain a i and exclude all earlier negative examples; remove them if impossible or if they are more general than a concept from G. 3. Remove each concept s S that is more general than another concept from S. 44 -

45 IF a i is negative THEN General VS-Algorithm (2) 1. Remove from S all concept that contain a i. 2. Replace the concepts g G by the most general specializations of g that exclude a i and contain all earlier positive examples; if this is impossible or if they are more special some concept from S remove them. 3. Remove each concept g G that is more special than some other concept from G. Termination criteria: IF S = G and S = 1 THEN STOP learning success with S IF S= {} oder H = {} THEN STOP failure IF all examples are processed THEN STOP Version space: S, G 45 -

46 Properties of the Algorithm Termination: The algorithm terminates if the set of examples is finite. Correctness: 1. If the wanted concept K is inthe hypothesis space then The algorithm stops after sufficiently many examples with success ; We get K = S. OR The algorithm stops after processing all examples. The wanted concept K is then in the version space. 2. If the algorithm stops with failure then the wanted concept is not in the hypothesis space. 3. If the wanted concept is not in the hypothesis space then no assertion can be made. 46 -

47 Advantages of the Version Space Method Incremental Correctness of learning result is guaranteed if the concept is in the hypothesis space The algorithm recognizes of no more examples are needed The algorithm can learn in principle more powerful representations 47 -

48 Disadvantages of the Version Space Method High complexity for more powerful (e.g. disjunctive) description languages The cardinality of the sets S and G can grow exponentially with the number of examples. Convergence of S and G is lost for disjunctive description languages : S = Disjunction of all positive examples G = Conjunction of all negative examples Convergence only after all examples are seen! 48 -

49 Quality Criteria for Learned Concepts Learning of concepts from examples is always an inductive conclusion The correctness of the learned concept for the whole set can not be assured Evaluation functions (quality of the learned concept): Classification quality: Percentage of the correctly classified elements of the set Costs for misclassification: Assumption: each misclassification causes costs The total sum (or expected sum) of costs has to be considered This runs under cost sensitive classification. 49 -

50 Discussion: Correctness (1) Correctness seems to be a natural quality condition for concept learning: We want that a learned concept classifies elements correctly. As a consequence, one concept is more correctly than another one if it classifies more elements correctly. The version space algorithm takes this up by considering only concepts that classify elements seen so far correctly. Despite the fact that this seems plausible so far the question arises: Does the fact that a concept classifies the seen examples correctly have an impact on the correctness of all elements? 50 -

51 Discussion: Correctness (2) We introduce a correctness measure for classifying concepts: corr(c) = {a M I C classifies a correctly}. This a partially ordered measure that refers to all elements, not only to the seen ones. If one regards this as a kind of landscape the experience in the area of hill climbing indicates that going upwards only is often not the best way. In fact, theoretical investigations show that this is provably not the case (see [Lange, Wiehagen]) 51 -

52 Classification and Diagnosis (1) Machine learning technology is well suited for the induction of diagnostic and prognostic rules and solving of small and specialized diagnostic and prognostic problems. What has to be done is to type the data, i.e. the records of the patients with known correct diagnosis, into the computer in the appropriate form and run the learning algorithm. This is of course an oversimplification, but in principle, the medical diagnostic knowledge can be automatically derived from the description of cases solved in the past. The derived classifier can then be used either to assist the physician when diagnosing new patients in order to improve the diagnostic speed, accuracy and/or reliability, or to train the students or physicians non-specialists to diagnose the patients in some special diagnostic problem. 52 -

53 Classification and Diagnosis (2) In the area of diagnostics several aspects occur because there may be several reasons for a fault or a disease. In order to make the connection between the examples and the learned concepts explicit one often uses rules, e.g.: Has_feaver( Bill) Sniffles(Bill) Coughs(Bill) Has_a_Cold(Bill) Has_feaver(Mary) Sniffles(Mary) Coughs(Mary) Has_Influenza(Mary) Learned rules: Has_feaver(Person) Sniffles(Person) Coughs(Person) Has_a_Cold(Person) Has_feaver(Person) Sniffles(Person) Coughs(Person) Has_Influenza(Person) 53 -

54 Comparison: Machine Learning-Human Four physicians specialists in each domain were tested. A subset of patients was randomly selected. The performances of physicians in the table are the averages of four physicians specialists in each domain and compared with two learning algorithms (naive Bayes, Assistant) The physicians were tested in University Medical Center in Ljubljana. Classifier Primary Tumor Breast Cancer Thyroid Rheumatology Naive Bayes 49% 78% 70% 67% Assistant 44% 77% 73% 61% Physicians 42% 64% 64% 66% Both algorithms significantly outperform the diagnostic performance of the physicians in terms of the classification accuracy and the average information score of the classifier. Source:Tatjana Zrimec and Igor Kononenko University of Ljubljana 54 -

55 Summary Different representations Completeness and Correctness The more-general relation, anti-unification Version space and version space algorithm Classification and Diagnosis 55 -

56 Recommended Literature T. Mitchell: Machine Learning. McGraw Hill, 1997 Tzung-Pei Hong, Shian-Shyong Tseng: Generalized Version Space Learning Algorithm for Noisy and Uncertain Data. CS Digital Library Papers with typical applications: Hee-Woong Lim, Ji-Eun Yun, Hae-Man Jang, Young-Gyu ChaiSuk-In Yoo, and Byoung-Tak Zhang: Version Space Learning with DNA Molecules. Lecture Notes In Computer Science; Vol Tessa Lau, Pedro Domingos, Daniel S. Weld: Learning Programs from Traces using Version Space Algebra. ICML 2000, Stanford, CA, June 2000, pp

Introduction to Machine Learning

Outline Contents Introduction to Machine Learning Concept Learning Varun Chandola February 2, 2018 1 Concept Learning 1 1.1 Example Finding Malignant Tumors............. 2 1.2 Notation..............................