Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Data Mining with Linear Discriminants Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Tday s Lecture Objectives 1 Recgnizing the ideas f artificial neural netwrks and their use in R 2 Understanding the cncept and the usage f supprt vectr machines 3 Being able t evaluate the predictive perfrmance in terms f bth metrics and the receiver perating characteristic curve 4 Distinguishing predictive and explanatry pwer Data Mining, with Linear Discriminants 2

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants 3

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Recap 4

Supervised vs. Unsupervised Learning Supervised learning Machine learning task f inferring a functin frm labeled training data Training data includes bth the input and the desired results crrect results (target values) are given Unsupervised learning Methds try t find hidden structure in unlabeled data The mdel is nt prvided with the crrect results during the training N errr r reward signal t evaluate a ptential slutin Examples: Hidden Markv mdels Dimensin reductin (e. g. by principal cmpnent analysis) Clustering (e. g. by k-means algrithm) grup int classes n the basis f their statistical prperties nly Data Mining, with Linear Discriminants: Recap 5

Feature B Feature B Taxnmy f Machine Learning Machine learning estimates functin and parameter in y = f(x,w) Type f methd varies depending n the nature f what is predicted Regressin Predicted value refers t a real number Cntinuus y Classificatin Clustering Predicted value refers t a class label Discrete y (e. g. class membership) Grup pints int clusters based n hw "near" they are t ne anther Identify structure in data Feature A Data Mining, with Linear Discriminants: Recap Feature A 6

K -Nearest Neighbr Classificatin Input: training examples as vectrs in a multidimensinal feature space, each with a class label N training phase t calculate internal parameters Testing: Assign t class accrding t k-nearest neighbrs Classificatin as majrity vte Prblematic Skewed data Unequal frequency f classes What label t assign t the circle?? Data Mining, with Linear Discriminants: Recap 7

Decisin Trees Flwchart-like structure in which ndes represent tests n attributes End ndes (leaves) f each branch represent class labels Example: Decisin tree fr playing tennis Outlk Sunny Overcast Rain Humidity Yes Wind High Nrmal Strng Weak N Yes N Yes Data Mining, with Linear Discriminants: Recap 8

Decisin Trees Issues Hw deep t grw? Hw t handle cntinuus attributes? Hw t chse an apprpriate attributes selectin measure? Hw t handle data with missing attributes values? Advantages Simple t understand and interpret Requires nly few bservatins Wrds, best and expected values can be determined fr different scenaris Disadvantages Infrmatin Gain criterin is biased in favr f attributes with mre levels Calculatins becme cmplex if values are uncertain and/r utcmes are linked Data Mining, with Linear Discriminants: Recap 9

Feature B k-means Clustering Partitin n bservatins int k clusters in which each bservatin belngs t the cluster with the nearest mean, serving as a prttype f the cluster Feature A Cmputatinally expensive; instead, we use efficient heuristics Default: Euclidean distance as metric and variance as a measure f cluster scatter Data Mining, with Linear Discriminants: Recap 10

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Linear Discriminants 11

Classificatin Prblems General classificatin prblem Gal: Take a new input x and assign it t ne f K classes C k Given training set X = [x 1... x n ] T with target values T = [t 1,...,t n ] T Number f dimensins D, i. e. x i R D Learn a discriminant functin y(x) t perfrm the classificatin 2-class prblem with binary target values t i {0,1} Decide fr class C 1 if y(x) > 0, else fr class C 2 K -class prblem with 1-f-K cding scheme, i. e. t i [0,1,0,0,0] T Data Mining, with Linear Discriminants: Linear Discriminants 12

Feature B Feature B Linear Separability If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable Feature A Linearly separable Feature A Nt linearly separable Data Mining, with Linear Discriminants: Linear Discriminants 13

Linear Discriminant Functins Decisin bundary given by y(x) = w T x + b = 0 defines a hyperplane Classes labeled accrding t sign ( w T x + b ) Nrmal vectr w and ffset b w y(x) > 0 y(x) = 0 x 2 y(x) < 0 w x 1 Data Mining, with Linear Discriminants: Linear Discriminants 14

Learning Discriminant Functins Linear discriminant functins given by y(x) = w T x + b = = D i=0 D i=1 w i x i + b w i x i with x 0 = 1 Weight vectr w "Bias" b, i. e. threshld Gal: Chse w and b, r w respectively, such that b t 1... w T X +. w T X T is minimal b t n Data Mining, with Linear Discriminants: Linear Discriminants 15

Chsing Discriminant Functin Slving w T X T by least-squares has drawbacks Least-squares is very sensitive t utliers Errr functin penalizes predictins that are "t crrect" Wrks nly fr linearly separable prblems Least-squares assumes Gaussian distributin 4 4 2 0 c c 2 0 c c -2-2 -4-4 -6-6 -8-8 -4-2 0 2 4 6 8-4 -2 0 2 4 6 8 Alternative slutins (e. g. in blue): Generalized linear mdels ( neural netwrks), supprt vectr machines, etc. frm Leibe (2010). Data Mining, with Linear Discriminants: Linear Discriminants 16

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Neural Netwrks 17

Generalized Linear Mdel Linear mdel y(x) = w T x + b Generalized linear mdel with activatin functin A y(x) = A ( w T x + b ) Other than least-squares, chice f activatin functin shuld limit influence f utliers, e. g. using a threshld as A A(z) 0.0 0.4 0.8 10 5 0 5 10 Data Mining, with Linear Discriminants: Neural Netwrks 18

Relatinship t Neural Netwrks In 2-class case y(x) = with x 0 = 1 D i=0 A(w i x i ) Single-layer perceptrn threshld w0 w1 y(x) wd utput weights In multi-class case y k (x) = with x 0 = 1 D i=0 A(w k,i x i ) Multi-class perceptrn threshlds wk0 y1(x) yk(x) utputs weights wki x0=1 x1 xd inputs X0 = 1 x1 xd inputs Data Mining, with Linear Discriminants: Neural Netwrks 19

Artificial Neural Netwrks Artificial neural netwrks (ANN) are cmputatinal mdels t cmpute f : X Y, inspired by the central nervus system Cmpute f by feeding infrmatin thrugh the netwrk Represented as a system f cnnected neurns ANNs are universal apprximatrs amng cntinuus functins (under certain mild assumptins) Input Hidden Output x 1 z 1 x N z M y 1 y 2 Data Mining, with Linear Discriminants: Neural Netwrks 20

Layers in Neural Netwrks Neurns are arranged in three (r mre) layers First layer: Input neurns receive the input vectr x X Hidden layer(s): Cnnect input and utput neurns Final layer: Output neurns cmpute a respnse ỹ Y Input Hidden Output x 1 z 1 x N z M y 1 y 2 When neurns are cnnected as a directed graph withut cycles, this is called a feed-frward ANN Data Mining, with Linear Discriminants: Neural Netwrks 21

Feeding Infrmatin thrugh Neural Netwrks Input z j f each neurn j = 1,...,M is a weighted sum f all previus neurns calculated as ( where z j = A w 0,j + N i=1 w i,j x i ) = A ( w 0,j + w T j x ) x i are the values frm the input layer suitable cefficients w i,j fr i = 1,...,N and j = 1,...,M Predefined nn-linear functin A is referred t as the activatin functin Frequent chice: Lgistic functin 1 A(z) = 1 + e z Cefficients w i,j learned frm e. g. a back-prpagatin algrithm Data Mining, with Linear Discriminants: Neural Netwrks 22

Lgistic Functin Lgistic A(z) 0.0 0.4 0.8 10 5 0 5 10 z Resembles a threshld functin { 0, z < 0, A(z) 1, z > 0 Data Mining, with Linear Discriminants: Neural Netwrks 23

Neural Netwrks in R Lading required library nnet library(nnet) Accessing credit scres library(caret) data(germancredit) Split data int index subset fr training (20%) and testing (80%) instances intrain <- runif(nrw(germancredit)) < 0.2 Data Mining, with Linear Discriminants: Neural Netwrks 24

Neural Netwrks in R Train neural netwrk with n ndes in the hidden layer via nnet(frmula, data=d, size=n...) nn <- nnet(class ~., data=germancredit[intrain,], size=15, maxit=100, rang=0.1, decay=5e-4) ## # weights: 946 ## initial value 139.233471 ## iter 10 value 120.009852 ## iter 20 value 111.986450 ## iter 30 value 92.182560 ## iter 40 value 88.672309 ## iter 50 value 85.914152 ## iter 60 value 85.220372 ## iter 70 value 85.121310 ## iter 80 value 85.061444 ## iter 90 value 84.634114 ## iter 100 value 81.262052 ## final value 81.262052 ## stpped after 100 iteratins Data Mining, with Linear Discriminants: Neural Netwrks 25

Neural Netwrks in R Predict credit scres via predict(nn, test, type="class") pred <- predict(nn, GermanCredit[-inTrain,], type="class") Cnfusin matrix via table(pred=pred_classes, true=true_classes) cm <- table(pred=pred, true=germancredit$class[-intrain]) Calculate accuracy sum(diag(cm))/sum(sum(cm)) ## [1] 0.7387 Data Mining, with Linear Discriminants: Neural Netwrks 26

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Supprt Vectr Machines 27

Supprt Vectr Machine (SVM) Which f these linear separatrs is ptimal? Idea: Maximize separating margin (here: A) Data pints n the margin are called supprt vectrs When calculating decisin bundary, nly supprt vectrs matter; ther training data is ignred Frmulatin as cnvex ptimizatin prblem with glbal slutin y B A x Data Mining, with Linear Discriminants: Supprt Vectr Machines 28

SVM in R Lading required library e1071 library(e1071) Accessing credit scres library(caret) data(germancredit) Split data int index subset fr training (20%) and testing (80%) instances intrain <- runif(nrw(germancredit)) < 0.2 Data Mining, with Linear Discriminants: Supprt Vectr Machines 29

SVM in R Train supprt vectr machine fr classificatin via svm(frmula, data=d, type="c-classificatin") mdel <- svm(class ~., data=germancredit[intrain,], type="c-classificatin") Predict credit scres fr testing instances test via predict(svm, test) pred <- predict(mdel, GermanCredit[-inTrain, ]) head(cbind(pred, GermanCredit$Class[-inTrain])) ## pred ## 2 2 1 ## 3 2 2 ## 4 2 2 ## 5 2 1 ## 6 2 2 ## 7 2 2 First rw gives predicted utcmes, secnd are actual (true) values Data Mining, with Linear Discriminants: Supprt Vectr Machines 30

SVM in R Cnfusin matrix via table(pred=pred_classes, true=true_classes) cm <- table(pred=pred, true=germancredit$class[-intrain]) cm ## true ## pred Bad Gd ## Bad 57 1 ## Gd 243 698 Calculate accuracy sum(diag(cm))/sum(sum(cm)) ## [1] 0.7558 Data Mining, with Linear Discriminants: Supprt Vectr Machines 31

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Predictin Perfrmance 32

Assessment f Mdels 1 Predictive perfrmance (measured by accuracy, recall, F1, ROC,... ) 2 Cmputatin time fr bth mdel building and predicting 3 Rbustness t nise in predictr values 4 Scalability 5 Interpretability transparency, ease f understanding Mdel Allws n < k Interpret. # Tuning Parameters Rbust t Predictr Nise Cmp. Time Linear Regressin 0 ANN 1 2 SVM/SVR 1 3 k-nn 1 Single Tree 1 Randm Frest 0 1 Bsted Trees 3 frm Kuhn & Jhnsn (2013). Applied Predictive Mdeling. Data Mining, with Linear Discriminants: Predictin Perfrmance 33

Accuracy vs. Precisin Accuracy Clseness t the actual (true) value Precisin Similarity f repeated measurements under unchanged cnditins High accuracy, but lw precisin High precisin, but lw accuracy Data Mining, with Linear Discriminants: Predictin Perfrmance 34

Cnfusin Matrix Cnfusin matrix (als named cntingency table r errr matrix) displays predictive perfrmance Cnditin (as determined by Gld standard) True False Psitive Outcme True Psitive (TP) False Psitive (FP) Type I Errr False Alarm Precisin r Psitive Predictive Value = TP TP+FP Negative Outcme False Negative (FN) Type II Errr / Miss True Negative (TN) Sensitivity = TP Rate = TP TP+FN Specificity = TN Rate = TN FP+TN Accuracy = TP+TN Ttal Equivalent with hit rate and recall Data Mining, with Linear Discriminants: Predictin Perfrmance 35

Cnfusin Matrix Example: Bld prbe t test fr cancer Patient with Cancer True False Psitive Bld Test Outcme TP: Cancer crrectly diagnsed FP: Healthy persn diagnsed cancer Precisin = TP TP+FP Negative Bld Test Outcme FN: Cancer nt diagnsed TN: Healthy persn diagnsed as healthy Sensitivity = TP Rate = TP TP+FN Specificity = TN Rate = TN FP+TN Accuracy = TP+TN Ttal Different lss functins: Redundant, e 1000 check in FP case, cmpared t lethal utcme in FN case Data Mining, with Linear Discriminants: Predictin Perfrmance 36

Assessing Predictin Perfrmance Imagine the fllwing cnfusin matrix with an accuracy f 65% Patient with Cancer True False Psitive Bld Test Outcme Negative Bld Test Outcme TP = 60 FP = 5 Precisin = TP TP+FP 0.92 FN = 30 TN = 5 Sensitivity = TP TP+FN 0.67 Specificity = TN FP+TN = 0.50 Accuracy = TP+TN Ttal = 0.65 Is this a "gd" result? N, because f unevenly distributed data, a mdel which always guesses psitive will scre an accuracy f 60% Data Mining, with Linear Discriminants: Predictin Perfrmance 37

Predictin Perfrmance in R Cnfusin matrix in variable cm ## True False ## Ps 30 10 ## Neg 20 40 Calculating accuracy (cm[1,1]+cm[2,2])/ (cm[1,1]+cm[1,2]+cm[2,1]+cm[2,2]) ## [1] 0.7 # Alternative that wrks als fr multi-class data sum(diag(cm))/sum(sum(cm)) ## [1] 0.7 Calculating precisin cm[1, 1]/(cm[1, 1] + cm[1, 2]) ## [1] 0.75 Data Mining, with Linear Discriminants: Predictin Perfrmance 38

Trade-Off: Sensitivity vs. Specificity/Precisin Perfrmance gals frequently place mre emphasis n either sensitivity r specificity/precisin Example: Airprt scanners triggered n lw-risk items like belts (lw precisin), but reduce risk f missing risky bjects (high sensitivity) Trade-Off: F1 scre is the harmnic mean f precisin and sensitivity 2TP F1 = 2TP + FP + FN Visualized by receiver perating characteristic (ROC) curve Data Mining, with Linear Discriminants: Predictin Perfrmance 39

Sensitivity Receiver Operating Characteristic (ROC) ROC illustrates perfrmance f binary classifier as its discriminatin threshld y(x) is varied Interpretatin: Curve A is randm guessing (50% crrect guesses) Curve frm mdel B perfrms better than A, but wrse than C Curve C frm perfect predictin Area suth-east f curve is named area under the curve and shuld be maximized 1 0.75 0.5 0.25 0 C B A 0 0.25 0.5 0.75 1 1-Specifity Data Mining, with Linear Discriminants: Predictin Perfrmance 40

ROC in R Lad required library proc library(proc) Perfrm predictin and retrieve decisin values dv pred <- predict(mdel, GermanCredit[-inTrain,], decisin.values=true) dv <- attributes(pred)$decisin.values Data Mining, with Linear Discriminants: Predictin Perfrmance 41

ROC in R Plt ROC curve via plt.rc(classes, dv) plt.rc(as.numeric(germancredit$class[-intrain]), dv, xlab = "1 - Specificity") Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 1 Specificity ## ## Call: ## plt.rc.default(x = as.numeric(germancredit$class[-intrain]), predictr = dv, xlab = ## ## Data: dv in 300 cntrls (as.numeric(germancredit$class[-intrain]) 1) > 699 cases (as.num ## Area under the curve: 0.692 Data Mining, with Linear Discriminants: Predictin Perfrmance 42

Predictive vs. Explanatry Pwer Significant difference between predicting and explaining: 1 Empirical Mdels fr Predictin Empirical predictive mdels (e. g. statistical mdels, methds frm data mining) designed t predict new/future bservatins Predictive Analytics describes the evaluatin f the predictive pwer, such as accuracy r precisin 2 Empirical Mdels fr Explanatin Any type f statistical mdel used fr testing causal hypthesis Use methds fr evaluating the explanatry pwer, such as statistical tests r measures like R 2 Data Mining, with Linear Discriminants: Predictin Perfrmance 43

Predictive vs. Explanatry Pwer N = 15 N = 100 Explanatry pwer des nt imply predictive pwer Red is the best explanatry mdel; gray the best predictive In particular, dummies d nt translate well t predictive mdes D nt write smething like "the regressin prves the predictive pwer f regressr x i " 44 Data Mining, with Linear Discriminants: Predictin Perfrmance

Overfitting When learning algrithm is perfrmed fr t lng, the learner may adjust t very specific randm features nt related t the target functin Overfitting: Perfrmance n training data (in gray) still increases, while the perfrmance n unseen data (in red) becmes wrse y 0.4 0.6 0.8 1.0 Mean Squared Errr 0.000 0.010 0.020 0.05 0.10 0.15 0.20 0.25 0.30 0.35 x 2 4 6 8 10 12 Flexibility Data Mining, with Linear Discriminants: Predictin Perfrmance 45

Overfitting in R Given data x t predict y plt(x, y, pch = 20) Estimate plynmials f rder P = 1,...,6 and cmpare errrs n training and testing data intrain <- runif(length(x)) < 0.2 err.fit <- c() err.pred <- c() y 50 30 10 10 2 0 2 4 6 8 fr (i in 1:6) { # Explanatry mdel xx <- x[intrain] m <- lm(y[intrain] ~ ply(xx, i, raw=true)) err.fit[i] <- sqrt(mean(m$residuals^2)) # Predictive mdel xx <- x[-intrain] err <- predict(m, ply(xx, i, raw=true))-y[-intrain] err.pred[i] <- sqrt(mean(err^2)) } x Data Mining, with Linear Discriminants: Predictin Perfrmance 46

Overfitting in R Perfrmance n training data (in gray) still increases, while the perfrmance n unseen data (in red) becmes wrse plt(1:6, err.fit, type='l', ylim=c(3.5, 31), cl="gray", xlab="order f plynmial", ylab="errr n training/testing data") lines(1:6, err.pred, cl="firebrick3") Errr n training/testing data 5 10 15 20 25 30 1 2 3 4 5 6 Order f plynmial Data Mining, with Linear Discriminants: Predictin Perfrmance 47

Supprt Vectr Regressin in R Supprt vectr regressin (SVR) predicts cntinuus values Accessing weekly stck market returns fr 21 years library(islr) data(weekly) intrain <- runif(nrw(weekly)) < 0.2 When training, use type="eps-regressin" instead mdel <- svm(tday ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly[intrain,], type="eps-regressin") Cmpute rt mean square errr (RMSE) pred <- predict(mdel, Weekly[-inTrain, ]) sqrt(mean(pred - Weekly$Tday[-inTrain])^2) ## [1] 0.1278 given by RMSE = 1 N N i=1 (pred i true i ) 2 and cmpare t standard deviatin f σ 2 = 2.3536 (N in denminatr, nt N 1) Data Mining, with Linear Discriminants: Predictin Perfrmance 48

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Wrap-Up 49

Summary: Data Mining with Linear Discriminants Linearly separable Artificial neural netwrk Supprt vectr machine Cnfusin matrix Data can be perfectly classified by linear discriminant Feeding infrmatin thrugh the netwrk nnet(frmula, data=d, size=n...) Maximize separating margin svm(frmula, data=d, type=...) Tabular utline f crrect/incrrect guesses Predictive perfrmance Measured by accuracy, precisin,... ROC curve Overfitting Visualize trade-ff between sensitivity and specificity Perfrmance n training data increases, different frm unseen bservatins Data Mining, with Linear Discriminants: Wrap-Up 50