Insect ID. Antennae Length. Insect Class. Abdomen Length

Size: px

Start display at page:

Download "Insect ID. Antennae Length. Insect Class. Abdomen Length"

Lesley Daniels
6 years ago
Views:

1 We have seen that we can do machine learning on data that is in the nice flat file format Rows are objects Columns are features Taking a real problem and massaging it into this format is domain dependent, but often the most fun part of machine learning. Let see just one example. Insect ID Abdomen Length Antennae Length Insect Class Grasshopper Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydids

2 (Western Pipistrelle (Parastrellus hesperus) Photo by Michael Durham

3 Western pipistrelle calls A spectrogram of a bat call.

4 We can easily measure two features of bat calls. Their characteristic frequency and their call duration Characteristic frequency Call duration Bat ID Characteristic frequency Call duration (ms) Bat Species Western pipistrelle

9 Classification We have seen 2 classification techniques: Simple linear classifier, Nearest neighbor,. Let us see two more techniques: Decision tree, Naïve Bayes There are other techniques: Neural Networks, Support Vector Machines, that we will not consider..

10 H(X) I have a box of apples.. 1 Pr(X = good) = p then Pr(X = bad) = 1 p the entropy of X is given by binary entropy function attains its maximum value when p =

Antenna Length Decision Tree Classifier 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length

11 Antenna Length Decision Tree Classifier Abdomen Length Ross Quinlan Abdomen Length > 7.1? no yes Antenna Length > 6.0? Katydid no yes Grasshopper Katydid

12 Antennae shorter than body? Yes No 3 Tarsi? Grasshopper Yes No Foretiba has ears? Cricket Yes No Decision trees predate computers Katydids Camel Cricket

13 Decision Tree Classification Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

14 How do we construct the decision tree? Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they can be discretized in advance) Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left

15 Information Gain as A Splitting Criteria Select the attribute with the highest information gain (information gain is the expected reduction in entropy). Assume there are two classes, P and N Let the set of examples S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as p p n E( S) log 2 log 2 p n p n p n p n n 0 log(0) is defined as 0

16 Information Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Gain( A) E( Current set) E( all child sets) Note: entropy is at its minimum if the collection of objects is completely uniform

Person Hair Length Weight Age Class Homer 0 250 36 M Marge 10 150 34 F Bart 2 90 10 M Lisa 6 78 8

17 Person Hair Length Weight Age Class Homer M Marge F Bart M Lisa F Maggie F Abe M Selma F Otto M Krusty M Comic ?

18 p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes no Hair Length <= 5? Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Hair length Gain( A) E( Current set) E( all child sets) Gain(Hair Length <= 5) = (4/9 * /9 * ) =

p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes Weight <= 160? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.

19 p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes Weight <= 160? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Weight Gain( A) E( Current set) E( all child sets) Gain(Weight <= 160) = (5/9 * /9 * 0 ) =

p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes age <= 40? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.

20 p p n Entropy ( S) log 2 log 2 p n p n p n n p n yes age <= 40? no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = Let us try splitting on Age Gain( A) E( Current set) E( all child sets) Gain(Age <= 40) = (6/9 * 1 + 3/9 * ) =

21 Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified So we simply recurse! yes Weight <= 160? no This time we find that we can split on Hair length, and we are done! yes no Hair Length <= 2?

22 We need don t need to keep the data around, just the test conditions. Weight <= 160? How would these people be classified? yes Hair Length <= 2? no Male yes no Male Female

23 It is trivial to convert Decision Trees to rules Weight <= 160? yes Hair Length <= 2? yes no no Male Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female

24 Once we have learned the decision tree, we don t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call. Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.

25 PSA = serum prostate-specific antigen levels PSAD = PSA density TRUS = transrectal ultrasound Garzotto M et al. JCO 2005;23:

26 The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes Female Wears green? No Male For example, the rule Wears green? perfectly classifies the data, so does Mothers name is Jacqueline?, so does Has blue shoes

27 Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree

28 Which of the Pigeon Problems can be solved by a Decision Tree? 1) Deep Bushy Tree 2) Useless 3) Deep Bushy Tree The Decision Tree has a hard time with correlated attributes ?

29 Advantages/Disadvantages of Decision Trees Advantages: Easy to understand (Doctors love them!) Easy to generate rules Disadvantages: May suffer from overfitting. Classifies by rectangular partitioning (so does not handle correlated features very well). Can be quite large pruning is necessary. Does not handle streaming data easily

31 How would we go about building a classifier for projectile points??

32 width I. Location of maximum blade width 1. Proximal quarter 2. Secondmost proximal quarter 3. Secondmost distal quarter 4. Distal quarter II. Base shape 1. Arc-shaped 2. Normal curve 3. Triangular 4. Folsomoid III. Basal indentation ratio 1. No basal indentation (shallow) (deep) IV. Constriction ratio V. Outer tang angle <50 VI. Tang-tip shape 1. Pointed 2. Round 3. Blunt VII. Fluting 1. Absent 2. Present VIII. Length/width ratio > length = 3.10 width = 1.45 length length /width ratio= 2.13

33 I. Location of maximum blade width 1. Proximal quarter 2. Secondmost proximal quarter 3. Secondmost distal quarter 4. Distal quarter II. Base shape 1. Arc-shaped 2. Normal curve 3. Triangular 4. Folsomoid III. Basal indentation ratio 1. No basal indentation (shallow) (deep) IV. Constriction ratio V. Outer tang angle <50 VI. Tang-tip shape 1. Pointed 2. Round 3. Blunt VII. Fluting 1. Absent 2. Present VIII. Length/width ratio > Fluting? = TRUE? no yes Base Shape = 4 no yes Late Archaic Mississippian Length/width ratio = 2

34 We could also us the Nearest Neighbor Algorithm? Late Archaic Transitional Paleo Transitional Paleo Late Archaic

for Arrowheads Lexiang Ye and Eamonn Keogh

5 0 Shapelet Dictionary 0 100 200 300 400 I

shapelet decision tree classifier achieves

35 It might be better to use the shape directly in the decision tree Decision Tree for Arrowheads Lexiang Ye and Eamonn Keogh (2009) Time Series Shapelets: A New Primitive for Data Mining. SIGKDD 2009 Avonlea Clovis Mix Training data (subset) Clovis Avonlea (Clovis) (Avonlea) I II Shapelet Dictionary I Arrowhead Decision Tree II The shapelet decision tree classifier achieves an accuracy of 80.0%, the accuracy of rotation invariant one-nearest-neighbor classifier is 68.0%.

Naïve Bayes Classifier Thomas Bayes 1702-1761 We will

36 Naïve Bayes Classifier Thomas Bayes We will start off with a visual intuition, before looking at the math

37 Antenna Length Grasshoppers Katydids Abdomen Length Remember this example? Let s get lots more data

38 Antenna Length With a lot of data, we can build a histogram. Let us just build one for Antenna Length for now Katydids Grasshoppers

39 We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides

40 We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. There is a formal way to discuss the most probable classification p(c j d) = probability of class c j, given that we have observed d 3 Antennae length is 3

41 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 3 ) = 10 / (10 + 2) = P(Katydid 3 ) = 2 / (10 + 2) = Antennae length is 3

42 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 7 ) = 3 / (3 + 9) = P(Katydid 7 ) = 9 / (3 + 9) = Antennae length is 7 7

43 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 5 ) = 6 / (6 + 6) = P(Katydid 5 ) = 6 / (6 + 6) = Antennae length is 5

44 Bayes Classifiers That was a visual intuition for a simple case of the Bayes classifier, also called: Idiot Bayes Naïve Bayes Simple Bayes We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

45 Bayes Classifiers Bayesian classifiers use Bayes theorem, which says p(c j d ) = p(d c j ) p(c j ) p(d) p(c j d) = probability of instance d being in class c j, This is what we are trying to compute p(d c j ) = probability of generating instance d given class c j, We can imagine that being in class c j, causes you to have feature d with some probability p(c j ) = probability of occurrence of class c j, This is just how frequent the class c j, is in our database p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

e which is greater p(male drew) or p(female drew) (Note: Drew can be a male or female name ) Drew Barrymore What is the probability of being called drew

46 Assume that we have two classes c 1 = male, and c 2 = female. We have a person whose sex we do not know, say drew or d. Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male drew) or p(female drew) (Note: Drew can be a male or female name ) Drew Barrymore What is the probability of being called drew given that you are a male? p(male drew) = p(drew male ) p(male) p(drew) Drew Carey What is the probability of being a male? What is the probability of being named drew? (actually irrelevant, since it is that same for all classes)

47 This is Officer Drew (who arrested me in 1997). Is Officer Drew a Male or Female? Luckily, we have a small database with names and sex. Officer Drew We can use it to apply Bayes rule p(c j d) = p(d c j ) p(c j ) p(d) Name Sex Drew Male Claudia Female Drew Female Drew Female Alberto Male Karin Female Nina Female Sergio Male

48 Officer Drew p(c j d) = p(d c j ) p(c j ) p(d) p(male drew) = 1/3 * 3/8 = /8 3/8 p(female drew) = 2/5 * 5/8 = /8 3/8 Name Drew Sex Male Claudia Female Drew Drew Female Female Alberto Male Karin Nina Sergio Female Female Male Officer Drew is more likely to be a Female.

49 Officer Drew IS a female! Officer Drew p(male drew) = 1/3 * 3/8 = /8 3/8 p(female drew) = 2/5 * 5/8 = /8 3/8

50 So far we have only considered Bayes Classification when we have one attribute (the antennae length, or the name ). But we may have many features. How do we use all the features? p(c j d) = p(d c j ) p(c j ) p(d) Name Over 170CM Eye Hair length Sex Drew No Blue Short Male Claudia Yes Brown Long Female Drew No Blue Long Female Drew No Blue Long Female Alberto Yes Brown Short Male Karin No Blue Long Female Nina Yes Brown Short Female Sergio Yes Blue Long Male

51 To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d c j ) = p(d 1 c j ) * p(d 2 c j ) *.* p(d n c j ) The probability of class c j generating instance d, equals. The probability of class c j generating the observed value for feature 1, multiplied by.. The probability of class c j generating the observed value for feature 2, multiplied by..

52 To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d c j ) = p(d 1 c j ) * p(d 2 c j ) *.* p(d n c j ) p(officer drew c j ) = p(over_170 cm = yes c j ) * p(eye =blue c j ) *. Officer Drew is blue-eyed, over 170 cm tall, and has long hair p(officer drew Female) = 2/5 * 3/5 *. p(officer drew Male) = 2/3 * 2/3 *.

53 The Naive Bayes classifiers is often represented as this type of graph c j Note the direction of the arrows, which state that each class causes certain features, with a certain probability p(d 1 c j ) p(d 2 c j ) p(d n c j )

Naïve Bayes is fast and space efficient c j We can look up all the probabilities with a single scan of the database and store them in a (small) table p(d 1 c j

54 Naïve Bayes is fast and space efficient c j We can look up all the probabilities with a single scan of the database and store them in a (small) table p(d 1 c j ) p(d 2 c j ) p(d n c j ) Sex Over190 cm Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 Sex Long Hair Male Yes 0.05 No 0.95 Female Yes 0.70 No 0.30 Sex Male Female

Naïve Bayes is NOT sensitive to irrelevant features... Suppose we are trying to classify a persons sex based on several features, including eye color.

55 Naïve Bayes is NOT sensitive to irrelevant features... Suppose we are trying to classify a persons sex based on several features, including eye color. (Of course, eye color is completely irrelevant to a persons gender) p(jessica c j ) = p(eye = brown c j ) * p( wears_dress = yes c j ) *. p(jessica Female) = 9,000/10,000 * 9,975/10,000 *. p(jessica Male) = 9,001/10,000 * 2/10,000 *. Almost the same! However, this assumes that we have good enough estimates of the probabilities, so the more data the better.

56 An obvious point. I have used a simple two class problem, and two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values c j p(d 1 c j ) p(d 2 c j ) p(d n c j ) Animal Mass >10 kg Cat Yes 0.15 No 0.85 Dog Yes 0.91 No 0.09 Pig Yes 0.99 No 0.01 Animal Color Cat Black 0.33 White 0.23 Brown 0.44 Dog Black 0.97 White 0.03 Brown 0.90 Pig Black 0.04 White 0.01 Animal Cat Dog Pig

57 Problem! Naïve Bayes assumes independence of features p(d c j ) Naïve Bayesian Classifier p(d 1 c j ) p(d 2 c j ) p(d n c j ) Sex Over 6 foot Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 Sex Over 200 pounds Male Yes 0.11 No 0.80 Female Yes 0.05 No 0.95

58 Solution Consider the relationships between attributes p(d c j ) Naïve Bayesian Classifier p(d 1 c j ) p(d 2 c j ) p(d n c j ) Sex Over 6 foot Male Yes 0.15 No 0.85 Female Yes 0.01 No 0.99 Sex Over 200 pounds Male Yes and Over 6 foot 0.11 No and Over 6 foot 0.59 Yes and NOT Over 6 foot 0.05 No and NOT Over 6 foot 0.35

59 Solution Consider the relationships between attributes p(d c j ) Naïve Bayesian Classifier p(d 1 c j ) p(d 2 c j ) p(d n c j ) But how do we find the set of connecting arcs??

60 The Naïve Bayesian Classifier has a piecewise quadratic decision boundary Katydids Grasshoppers Ants Adapted from slide by Ricardo Gutierrez-Osuna

61 Which of the Pigeon Problems can be solved by a decision tree?

62 Advantages/Disadvantages of Naïve Bayes Advantages: Fast to train (single scan). Fast to classify Not sensitive to irrelevant features Handles real and discrete data Handles streaming data well Disadvantages: Assumes independence of features

63 Summary We have seen the four most common algorithms used for classification. We have seen there is no one best algorithm. We have seen that issues like normalizing, cleaning, converting the data can make a huge difference. We have only scratched the surface! How do we learn with no class labels? (clustering) How do we learn with expensive class labels? (active learning) How do we spot outliers (Anomaly detection) How do we.. Popular Science Book The Master Algorithm by Pedro Domingos Textbook Data Mining: by Charu C. Aggarwal

65 Malaria afflicts about 4% of all humans, killing one million of them each year. Malaria

67 Malaria Deaths (2003)

68 There are interventions to mitigate the problem A recent meta-review of randomized controlled trials of Insecticide Treated Nets (ITNs) found that ITNs can reduce malaria-related deaths in children by one fifth and episodes of malaria by half. Mosquito nets work!

69 How do we know where to do the interventions, given that we have finite resources?

tenth of a second to pass the laser. 0.2 0.1 0-0.1-0.

70 One second of audio from our sensor. The Common Eastern Bumble Bee (Bombus impatiens) takes about one tenth of a second to pass the laser Background noise Bee begins to cross laser Bee has past though the laser x

72 Y(f) One second of audio from the laser sensor. Only Bombus impatiens (Common Eastern Bumble Bee) is in the insectary. Background noise Bee begins to cross laser x x Hz interference Peak at 197Hz Harmonics Single-Sided Amplitude Spectrum of Y(t) Frequency (Hz)

73 Y(f) -3 x Frequency (Hz) Frequency (Hz) Frequency (Hz)

74 Y(f) -3 x Frequency (Hz) Frequency (Hz) Frequency (Hz)

75 Wing Beat Frequency Hz

76 Anopheles stephensi is a primary mosquito vector of malaria. The yellow fever mosquito (Aedes aegypti) is a mosquito that can spread dengue fever, chikungunya, and yellow fever viruses Wing Beat Frequency Hz

77 Anopheles stephensi: Female mean =475, Std = Aedes aegyptii : Female mean =567, Std = 43 If I see an insect with a wingbeat frequency of 500, what is it? P Anopheles wingbeat = 500 = 1 2π 30 e ( )

517 400 500 600 700 What is the error rate? 12.

78 What is the error rate? 12.2% of the area under the pink curve 8.02% of the area under the red curve Can we get more features?

79 Circadian Features Aedes aegypti (yellow fever mosquito) 0 dawn dusk Midnight Noon Midnight

80 Suppose I observe an insect with a wingbeat frequency of 420Hz What is it?

81 Suppose I observe an insect with a wingbeat frequency of 420Hz at 11:00am What is it? Midnight Noon Midnight

82 Suppose I observe an insect with a wingbeat frequency of 420 at 11:00am What is it? Midnight Noon Midnight (Culex [420Hz,11:00am]) = (6/ ( )) * (2/ ( )) = (Anopheles [420Hz,11:00am]) = (6/ ( )) * (4/ ( )) = (Aedes [420Hz,11:00am]) = (0/ ( )) * (3/ ( )) = 0.000

83 Blue Sky Ideas Once you have a classifier working, you begin to see new uses for them Let us see some examples..

Capturing or killing individually targeted insects Most efforts to capture or kill insects are shotgun. Many nontargeted insects (including beneficial ones) are killed/captured.

84 Capturing or killing individually targeted insects Most efforts to capture or kill insects are shotgun. Many nontargeted insects (including beneficial ones) are killed/captured. In some cases, the ratios are 1,000 to 1 (i.e. 1,000 non-targeted insects are effected for each one that was targeted). We believe our sensors allow an ultra precise approach, with a ratio approaching 1 to 1. This has obvious implications for SIT/metagenomics

Kill It seems obvious you could kill a mosquito with a powerful enough

But we need to do it fast, with as little power as possible.

(and falling) The mosquitoes may survive the laser strike, but they cannot

Hotel California for female mosquitoes (you can check out anytime you

85 Kill It seems obvious you could kill a mosquito with a powerful enough laser and with enough time. But we need to do it fast, with as little power as possible. We have gotten this down to 1/20 th of a second, and just 1 watt. (and falling) The mosquitoes may survive the laser strike, but they cannot fly away (as was the case in photo shown right) We are building a SIT Hotel California for female mosquitoes (you can check out anytime you like, but you can never leave) Culex tarsalis Collaboration with UCR mechanical engineers Amir Rose and Dr. Guillermo Aguilar Zoom-in (after removing the wing)

Capture We envision building robotic traps that can be left in the field, and programed with different sampling missions. Such traps could be placed and retrieved by drones.

86 Capture We envision building robotic traps that can be left in the field, and programed with different sampling missions. Such traps could be placed and retrieved by drones. Capturing live insects is important if you want to do metagenomics. Some examples of sampling missions Capture examples of gravid{aedes aegypti} Capture insects marked{ Cripple(left-C right-s) } Capture examples of insects that are NOT Anopheles AND have a wingbeat frequency > 400 (to exclude bees, etc.) Capture examples of any insects with a wingbeat frequency > 500, encountered between 4:00am and 4:10am Capture examples of fed{anopheles gambiae} OR fed{anopheles quadriannulatus} OR fed{anopheles melas}

87 Capture About 10% of the insects captured by Venus fly traps are flying insects We believe that we can build inexpensive mechanical traps that can capture sex/species targeted insects. Capture examples of gravid{aedes aegypti} Capture insects marked{ Cripple(left-C right-s) } Capture examples of insects that are NOT Anopheles AND have a wingbeat frequency > 400 (to exclude bees, etc.) Capture examples of any insects with a wingbeat frequency > 500, encountered between 4:00am and 4:10am Capture examples of fed{anopheles gambiae} OR fed{anopheles quadriannulatus} OR fed{anopheles melas}

88 Classification Problem: Fourth Amendment Cases before the Supreme Court II The Supreme Court s search and seizure decisions, terms. Keogh vs. State of California = {0,1,1,0,0,0,1,0} U = Unreasonable R = Reasonable

89 We can also learn decision trees for individual Supreme Court Members. Using similar decision trees for the other eight justices, these models correctly predicted the majority opinion in 75 percent of the cases, substantially outperforming the experts' 59 percent. Decision Tree for Supreme Court Justice Sandra Day O'Connor

Uncertainty. Yeni Herdiyeni Departemen Ilmu Komputer IPB. The World is very Uncertain Place

Uncertainty. Yeni Herdiyeni Departemen Ilmu Komputer IPB. The World is very Uncertain Place Uncertainty Yeni Herdiyeni Departemen Ilmu Komputer IPB The World is very Uncertain Place 1 Ketidakpastian Presiden Indonesia tahun 2014 adalah Perempuan Tahun depan saya akan lulus Jumlah penduduk Indonesia