Data Mining Chapter 1. What s it all about? 1
DM & ML Ubiquitous computing environment Excessive amount of data (data flooding) Gap between the generation of data and their understanding Looking for structural patterns in data i.e., intelligently analyzed data Data mining The process of discovering patterns in data Pattern: making useful predictions on new data Structural! (capturing the decision structure) 2
DM & ML Structural patterns Table 1.1 Contact Lens Data Recommendation: soft, hard, none e.g.) If tear production rate = reduced, then recommendation = none. Rule: generalizing the missing rows No null values (vs. real-life data sets) 3
DM & ML 4
Simple examples Attributes : the values of features Measuring different aspects of the instance The weather problem (Table 1.2) Attributes, Outcome Possible combinations : 3 x 3 x 2 x 2 = 36 A rule learned from the information e.g.) If outlook = overcast then play = yes Decision list A set of rules being interpreted in sequence Numeric values in Table 1.3 5
Simple examples Table 1.2 The weather data Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No 6
Simple examples Table 1.3 The weather data with some numeric attributes Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 70 96 False Yes Rainy 68 80 False Yes Rainy 65 70 True No overcast 64 65 True Yes Sunny 72 95 False No Sunny 69 70 False Yes Rainy 75 80 False Yes Sunny 75 70 True Yes Overcast 72 90 True Yes Overcast 81 75 False Yes Rainy 71 91 True No 7
Simple examples Classification rules vs. Association rules Association rules: strongly associating different attribute values e.g.) IF humidity = normal and windy = false, THEN play = yes IF outlook = sunny and play = no, THEN humidity = high Predicting any of the attributes 8
Simple examples More examples Contact lenses, Weather problem Irises: A classic numeric dataset Attributes : numeric Outcome : category Computer configurations Outcome : CPU performance regression (numeric prediction) Labor negotiations (realistic) Outcome : whether the contract is acceptable or not. (by both labor & management)? : unknown or missing values 9
Simple examples 10
Simple examples 11
Simple examples 12
Simple examples Soybean classification 35 attributes, 19 disease categories Domain Knowledge IF leaf condition = normal and, IF leaf malformation = absent and, THEN diagnosis is rhizoctonia-root-rot The computer-generated rules outperformed the expert-derived rules. (97.5% vs 72%) 13
Simple examples 14
Fielded applications Web mining How to rank web pages Decisions involving judgment Whether to lend you money 1,000 training examples of borderline cases 20 attributes : age, years with current employer, Solution: reject all borderline cases? No! Borderline cases are most active customers Learned rules: correct on 70% of cases but human experts only 50% Improving the success rate of the loan decisions, explaining the reasons behind the decision 15
Fielded applications Screening images Detecting oil slicks Oil slicks appear as dark regions with changing size and shape, and few training examples. Expensive process requiring highly trained personnel Input : a set of raw pixel images from a radar satellite Output : a set of images with putative oil slicks Attributes: size of region, shape, area, intensity, 16
Fielded applications Load forecasting An automated load forecasting assistant to determine future demand for power a utility supplier in the electricity industry Given: manually constructed load model that assumes normal climatic conditions Problem: adjust for weather conditions Attributes: temperature, humidity, wind speed, Collecting 15 years data Far quicker (seconds) than trained human forecasters (hours) 17
Fielded applications Diagnosis Principal application areas of expert systems Preventative maintenance of electromechanical devices (e.g. 600 faults) Learned rules were slightly superior to the handcrafted ones. The system was put into use because the domain expert approved of the rules. 18
Fielded applications Marketing and sales Customer loyalty: identifying customers that are likely to defect by detecting changes in their behavior (e.g. banks/phone companies) Special offers: identifying profitable customers (e.g. reliable owners of credit cards that need extra money during the holiday season) Market basket analysis: Finding groups of items that tend to occur together in transactions of supermarket checkout data Manufacturing, customer support & service, scientific applications, monitoring, etc. 19
The data mining process 20
http://cis.catholic.ac.kr/sunoh 21