Discrete Bayes. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Size: px

Start display at page:

Download "Discrete Bayes. Robert M. Haralick. Computer Science, Graduate Center City University of New York"

Marvin Johnson
5 years ago
Views:

1 Discrete Bayes Robert M. Haralick Computer Science, Graduate Center City University of New York

2 Outline 1 The Basic Perspective 2 3

3 Perspective

4 Unit of Observation Not Equal to Unit of Classification Consider the task of recognizing a truck with a machine gun mounted on its back end in the context of guerilla urban warfare. The sensor is a camera. The unit of observation is the pixel, millions of them. No one pixel of the truck carries sufficient information about the class.

5 Grouping The units of observation have to be grouped into relevant groups. Classification must be based on features extracted from the relevant groups.

6 Cyber Attack Units of Observation The units of observations are transaction records. For example the records might be records of the secure log. The problem is to decide whether or not an IP address should be banned. Aug 16 04:41:06 cunygrid sshd[12535]: Invalid user alexis from Aug 16 04:41:06 cunygrid sshd[12543]: input_userauth_request: invalid user alexis Aug 16 04:41:06 cunygrid sshd[12535]: pam_unix(sshd:auth): check pass; user unknown Aug 16 04:41:06 cunygrid sshd[12535]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=

7 Cyber Attack Units of Observation Each observation is a record A record is a set of attribute value pairs Attributes Date Time Connection ID Login ID IP address Login Id accepted/rejected Password Password accepted/rejected Disconnection Values

8 Grouping Units of Observation Thread by connection id Thread by login ID Thread by IP address Thread by Time interval

9 Extracting Features For a set of records in the given time interval with same IP address: Number of records Number of login ID records Number of login ID s accepted Number of disconnects

10 Extracting Features For a set of records in the given time interval with same IP address and login ID: Number of records Number of passwords attempted Number of passwords accepted Number of disconnects

11 Grouping and Feature Extracting The process of grouping together the raw observation units to the unit of observation in the machine learning process and the process of defining the feature vector for the grouped units is part of the art of Machine Learning.

12 Be Rational, Use Expected Value When faced with a number of actions, each of which could give rise to more than one possible outcome with different probabilities, the rational action is to: Identify all possible outcomes Determine their values, utilities, (positive or negative) Determine the probabilities that will result from each course of action Multiply the two to give an expected value The best action is the one that gives rise to the highest expected value (Blaise Pascal, Pensées, 1670)

13 Highest Expected Value a 1, a 2 Two possible actions e 1, e 2, e 3 Three possible economic gains The Probabilities for three gains under action a 1 P(e 1 a 1 ), P(e 2 a 1 ), P(e 3 a 1 ) The Probabilities for three gains under action a 2 P(e 1 a 2 ), P(e 2 a 2 ), P(e 3 a 2 ) Probabilities P(e a) Expected Value Action e 1 = 1 e 2 =3 e 3 =0 a = 1.75 a = 1.1 Choose Action a 1

14 Bayes Theorem c Class d Observed data P(d c) The class conditional probabilities Current state of knowledge P(c). New evidence, the observed data d. What is the updated probability of c: P(c d)? P(c d) = P(c,d) P(d) P(d c) = P(c,d) P(c) P(c d) = P(d c)p(c) P(d) Thomas Bayes ( ), first showed how to use new evidence to update beliefs.

15 Units Observed In what follows, the unit being observed is assumed to be that which arises from the grouping process. The measurement is assumed to be the feature vector.

16 Basic Process Outline In the pattern discrimination or pattern identification process, a unit is observed or measured and a category assignment is made on the basis of the measurement. This event can be characterized by its three parts: unit has true category identification c j, a member of the set C decision rule assignment is to category c k, a member of the set C measurement d is made from the set D

17 unit has true category identification c j, a member of the set C decision rule assignment is to category c k, a member of the set C measurement d is made from the set D Denote this event by (c j, c k, d) It has a probability of occurring: P(True = c j, Assigned = c k, Meas = d)

18 Economic Consequences The act of making the assignment carries consequences, economically or in terms of utility. Assumption These consequences depend only on which category is the true category identification for the unit and which category is the assigned category identification for the unit. They do not depend on which particular unit is being assigned or on what measurement values the unit to be assigned has.

19 Economic Consequences: Gain Matrix ASSIGNED c 1 c 2 c k c K c 1. T R U c j e(c j, c k ) E... c K The Economic Gain Matrix is Application Dependent

20 Confusion Matrix ASSIGNED c 1 c 2 c k c K c 1. T R U c j P TA (c j, c k ) E... c K The Confusion Matrix Tells How Correct Is the Machine Learning

21 Probability of Correct Identification Definition The probability of correct identification, P c, is the probability that the True Identification equals the Assigned Identification. It is defined by K P c = P TA (c j, c j ) j=1 the sum of the diagonal entries of the confusion matrix.

22 Confusion Probabilities What is the probability of observing a unit whose true class identification is c j and whose assigned class identification is c k? P TA (c j, c k ) = P TA (c j, c k, d) d D

23 Expected Economic Gain For each true-assigned category identification pair (c j, c k ) we know its economic gain e(c j, c k ) and probability P TA (c j, c k )

24 Expected Economic Gain How do we compute the average or expected value of the consequence? E[e] = = K K e(c j, c k )P TA (c j, c k ) j=1 k=1 K K e(c j, c k ) P TA (c j, c k, d) j=1 k=1 d D

25 Expected Gain ASSIGNED c 1 c 2 c K T c 1 R c 2 U E P TA (c j, c k ) ASSIGNED c 1 c 2 c K T c 1 R c 2 U E e(c j, c k ) c K Confusion Matrix c K Economic Gain Matrix K K E[e] = e(c j, c k )P TA (c j, c k ) j=1 k=1

26 Analysis What are the processes which lead to the event (c j, c k, d) occurring? 1 There is a unit having measurement d whose true class identification is c j. 2 Using only the measurement d, the decision rule makes the assignment to class c k.

27 Fair Game Assumption Nature and the Decision Rule are not in collusion In making the class assignment, the decision rule must only use the measurement data. The decision rule cannot use the true class identification. nature only uses d in deciding true class identification c j decision rule only uses d in deciding the assigned class identification c k

28 Fair Game Assumption c k assigned class c j true class d measurement data P AT (c k (c j, d)) = P A (c k d) Therefore, P AT (c k, c j d) = P AT (c k, c j, d) P(d) = P AT (c k c j, d)p T (c j, d) P(d) = P A (c k d)p T (c j d)

29 Fair Game Assumption Nature and the decision rule are not in collusion. Given that a unit has measurements, d, the assigned class which the decision rule determines and the true class which nature determines are statistically independent. P AT (c k, c j d) = P A (c k d)p T (c j d)

30 The Probabilistic Decision Rule Let f d (c k ) be the conditional probability that the decision rule assigns a unit to class c k given that the unit has measurement d. We call f the decision rule.

31 The Probabilistic Decision Rule Since, conditioned on measurement d, the true and assigned class are statistically independent, P TA (c j, c k d) = P T (c j d)f d (c k )

32 The Probabilistic Decision Rule By definition of conditional probability, P TA (c j, c k d) = P TA(c j, c k, d) P(d) P TA (c j, c k, d) = P TA (c j, c k d)p(d)

33 The Probabilistic Decision Rule P TA (c j, c k, d) = P TA (c j, c k d)p(d) = P T (c j d)f d (c k )P(d) = P T (c j, d)f d (c k )

34 Expected Gain The Expected Gain is then E[e; f ] = = = K K j=1 k=1 K K j=1 k=1 K d D k=1 e(c j, c k ) P TA (c j, c k, d) d D e(c j, c k ) P T (c j, d)f d (c k ) d D K f d (c k ) e(c j, c k ) P T (c j, d) j=1

35 Maximizing a 1,..., a L is a sequence of L numbers p 1,..., p L is a sequence of unknown numbers satisfying p i 0 L p i = 1 i=1 L Find an upper bound for a i p i over all possible such i=1 sequences p 1,..., p L satisfying the constraints.

36 Maximizing Since p i 0, a i max a j for i = 1,..., L j j=1,...,l a i p i max j j=1,...,l a j p i

37 Maximizing Since inequality holds for each i, it must also hold for the sum. L i=1 a i p i L max j i=1 max j j=1,...,l j=1,...,l a j a j L i=1 p i p i max j j=1,...,l a j

38 Maximizing If p 1,..., p L is a sequence of L numbers satisfying p i 0 and L p i = 1 i=1 then max a j j j=1,...l L a i p i i=1 Thus, max a j j j=1,...,l is an upper bound for L a i p i i=1

39 Maximizing Is this upper bound, max j j=1,...,l a j for the lowest possible upper bound? L a i p i i=1 Yes, since it is achievable by appropriate choice of p i s Set p i = 0 if a i < max a j. j j=1,...,l Set remaining p i s so that 0 p i and L p i = 1. i=1

40 Maximizing Suppose a ij, i = 1, 2,..., L and j = 1, 2,..., J is a sequence of numbers and p ij satisfies J 0 p ij and p ij = 1 for i = 1, 2,..., L. Then the lowest upper bound for j=1 J j=1 a ij p ij is max k k=1,...,k a ik, i = 1, 2,..., L.

41 Maximizing a ij max a ik for each i and j. Since p ij 0. k p ij aij p ij max k a ik for each i and j. Summing over j, J p ij a ij j=1 J j=1 max k p ij max k a ik. a ik = max k a ik J j=1 p ij Thus, max k a ik is an upper bound for J p ij a ij. j=1

42 Maximizing max k a ik is the lowest upper bound for achievable. Set p ij = 0 if a ij max k a ik Set remaining p ij s so that J 0 p ij and p ij = 1. j=1 J p ij a ij since it is j=1

43 Expected Gain The Expected Gain is then E[e; f ] = = = K K j=1 k=1 K K j=1 k=1 K d D k=1 e(c j, c k ) P T (c j, c k, d) d D e(c j, c k ) P T (c j, d)f d (c k ) d D K f d (c k ) e(c j, c k ) P T (c j, d) j=1

44 Maximizing Expected Gain But for any conditional probability f d (c), K K f d (c k ) e(c d D j, c k )P T (c j, d) k=1 j=1 K max e(c n j, c n )P T (c j, d) d D j=1

45 Maximizing Expected Gain If for each d, there exists a unique m = m(d) such that K e(c j, c k )P T (c j, d) j=1 K e(c j, c m )P T (c j, d), for each k, j=1 Then setting f d (c m ) = 1 f d (c k ) = 0, k m achieves the upper bound. }

46 Maximizing Expected Gain In general, the upper bound is achievable by using a decision rule f defined by f d (c k ) = 0 for each c k and d such that K j=1 e(c j, c k )P T (c j, d) < max n K e(c j, c n )P T (c j, d) j=1 and the remaining f d (c k ) are set in any way so that f d (c k ) 0 K f d (c k ) = 1 k=1

47 Bayes Decision Rule Definition A decision rule f is called a Bayes decision rule if and only if E[e; f ] E[e; g] for any decision rule g.

48 Bayes Decision Rule A Bayes rule can always be implemented as a deterministic decision rule. ASSIGNED c 1 c 2 c K T c 1 P T (c 1, d) e(c 1, c 1 ) e(c 1, c 2 ) e(c 1, c K ) R c 2 P T (c 2, d) e(c 2, c 1 ) U... E. c K P T (c K, d) e(c K, c 1 ) e(c K, c 2 ) e(c K, c K ) K e(c j, c k )P T (c j, d) j=1

49 Bayes Decision Rule Given measurement d, assign to any class c k satisfying K e(c j, c k )P T (c j, d) j=1 K e(c j, c n )P T (c j, d), n = 1,..., K j=1 If the economic gain matrix e is the identity, then assign to any class c k satisfying P T (c k, d) P T (c n, d), n = 1,..., K E[e] = max n d D n=1,...,k P T (c n, d) = P correct

50 Bayes Decision Rule P T (c, d) Measurement e Assigned True d 1 d 2 d 3 True c 1 c 2 c c c c f d (c) Measurement Assigned d 1 d 2 d 3 c c E[e] = = 1.02

51 The Confusion Matrix P TA (c 1, c 1 ) = P TA (c 1, c 2 ) = P TA (c 2, c 1 ) = P TA (c 2, c 2 ) = f d (c 1 )P T (c 1, d) d D f d (c 2 )P T (c 1, d) d D f d (c 1 )P T (c 2, d) d D f d (c 2 )P T (c 2, d) d D P TA Assigned True c 1 c 2 c 1 P TA (c 1, c 1 ) P TA (c 1, c 2 ) c 2 P TA (c 2, c 1 ) P TA (c 2, c 2 )

52 The Confusion Matrix P T (c, d) Measurement True d 1 d 2 d 3 c c f d c 2 c 2 c 1 P TA Assigned P T True c 1 c 2 c c

53 The Confusion Matrix P T (c, d) Measurement True d 1 d 2 d 3 c c f d c 2 c 1 c 1 P TA Assigned P T True c 1 c 2 c c

54 Bayes Decision Rule Given measurement d, assign to any class c k satisfying K K e(c j, c k )P T (c j, d) e(c j, c n )P T (c j, d), n = 1,..., K j=1 j=1 Or equivalently, assign to any class c k satisfying K j=1 e(c j, c k ) P T (c j, d) P(d) K j=1 e(c j, c n ) P T (c j, d), n = 1,..., K P(d)

55 Bayes Decision Rule Given measurement d, assign to any class c k satisfying K e(c j, c k )P T (c j d) j=1 K e(c j, c n )P T (c j d), n = 1,..., K j=1 Therefore, the only probability information we need for a Bayes rule are the conditional probabilities P T (c j d), j = 1,..., K

56 Two Class Bayes Rule Assume the economic gain for a correct assignment is greater than the gain for an incorrect assignment. e(c 1, c 1 ) > e(c 1, c 2 ) e(c 2, c 2 ) > e(c 2, c 1 ) Assume all probabilities are positive Assign to class c 1 if P(c 1 d)e(c 1, c 1 ) + P(c 2 d)e(c 2, c 1 ) P(c 1 d)e(c 1, c 2 ) + P(c 2 d)e(c 2, c 2 ) Otherwise assign to class c 2. This inequality can be simplified. P(c 1 d)e(c 1, c 1 ) P(c 1 d)e(c 1, c 2 ) P(c 2 d)e(c 2, c 2 ) P(c 2 d)e(c 2, c 1 ) P(c 1 d)(e(c 1, c 1 ) e(c 1, c 2 )) P(c 2 d)(e(c 2, c 2 ) e(c 2, c 1 )) P(c 1 d) P(c 2 d) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 )

57 Two Class Bayes Rule: Odd s Ratio Assign to class c 1 if P(c 1 d) P(c 2 d) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) Otherwise assign to class c 2. The odd s ratio R in favor of class c 1 is defined by Assign to class c 1 if R(d) = P(c1 d) P(c 2 d) R(d) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) Θ

58 Two Class Bayes Rule: Likelihood Ratio The Likelihood ratio in favor of class c 1 is defined by Assign to class c 1 if P(c 1 d) P(c 2 d) P(c 1 d)p(d)/p(c 1 ) P(c 2 d)p(d)/p(c 2 ) L(d) = P(d c1 ) P(d c 2 ) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) P(c 2 ) P(c 1 ) L(d) = P(d c1 ) P(d c 2 e(c2, c 2 ) e(c 2, c 1 ) P(c 2 ) ) e(c 1, c 1 ) e(c 1, c 2 ) P(c 1 ) L(d) Θ

59 Two Class Bayes Rule: Class Conditional Probability Assign to class c 1 if P(c 1 d)e(c 1, c 1 ) + P(c 2 d)e(c 2, c 1 ) P(c 1 d)e(c 1, c 2 ) + P(c 2 d)e(c 2, c 2 ) Otherwise assign to class c 2. P(c 1 d)(e(c 1, c 1 ) e(c 1, c 2 )) P(c 2 d)(e(c 2, c 2 ) e(c 2, c 1 )) P(c 1 d)(e(c 1, c 1 ) e(c 1, c 2 )) (1 P(c 1 d))(e(c 2, c 2 ) e(c 2, c 1 )) P(c 1 d)(e(c 1, c 1 ) e(c 1, c 2 ) + e(c 2, c 2 ) e(c 2, c 1 )) e(c 2, c 2 ) e(c 2, c 1 ) P(c 1 d) e(c 2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) + e(c 2, c 2 ) e(c 2, c 1 ) Θ

60 Summary Given measurement d, assign to any class c k satisfying K e(c j, c k )P T (c j d) j=1 K e(c j, c n )P T (c j d), n = 1,..., K j=1 Therefore, the only probability information we need for a Bayes rule are the conditional probabilities P T (c j d), j = 1,..., K

61 Summary P T (c d) = P T (d c)p(c) P(d) P T (d c)p(c) = γ C P T (d γ)p(γ) For a given c, P T (d c) is the class conditional distribution P T (c) is the prior probability of class c Estimating P T (d c) from data is a major problem in Machine Learning

62 Two Class Bayes Rule: Odd s Ratio Assign to class c 1 if P(c 1 d) P(c 2 d) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) Otherwise assign to class c 2. The odd s ratio R in favor of class c 1 is defined by Assign to class c 1 if R(d) = P(c1 d) P(c 2 d) R(d) e(c2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) Θ

63 Two Class Bayes Rule: Likelihood Ratio The Likelihood ratio in favor of class c 1 is defined by Assign to class c 1 if L(d) = P(d c1 ) P(d c 2 ) L(d) = P(d c1 ) P(d c 2 e(c2, c 2 ) e(c 2, c 1 ) ) e(c 1, c 1 ) e(c 1, c 2 ) L(d) Θ P(c 2 ) P(c 1 )

64 Two Class P(c 1 d) e(c 2, c 2 ) e(c 2, c 1 ) e(c 1, c 1 ) e(c 1, c 2 ) + e(c 2, c 2 ) e(c 2, c 1 ) Θ

65 Major Problem Estimating P T (d c) from data is a major problem in Machine Learning

66 Mid Term Project Part 1 Measurement Space D = N n=1 L n Measurement d has N components d = (d 1,..., d N ) Input L n = M n, n = 1,..., N Given (d 1,..., d N ) compute the linear address Input K, The Number of Classes Input e the K K economic gain matrix Input P(d c) the class conditional probabilities Input P(c) the prior probabilities Compute P(c d) Compute the Discrete Bayes Rule f d (c) Compute the K K Confusion Matrix Compute the Expected Gain

Introduction to Machine Learning

Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with