Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Similar documents
Chapter 3: Cluster Analysis

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Pattern Recognition 2014 Support Vector Machines

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Simple Linear Regression (single variable)

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Text Mining. Exercise: Business Intelligence (Part 7) Summer Term 2014 Stefan Feuerriegel

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

IAML: Support Vector Machines

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

The blessing of dimensionality for kernel methods

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

COMP 551 Applied Machine Learning Lecture 4: Linear classification

What is Statistical Learning?

Support-Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Part 3 Introduction to statistical classification techniques

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear Classification

Multiple Source Multiple. using Network Coding

A Scalable Recurrent Neural Network Framework for Model-free

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Tree Structured Classifier

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Checking the resolved resonance region in EXFOR database

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

7 TH GRADE MATH STANDARDS

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Artificial Neural Networks MLP, Backpropagation

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

Computational modeling techniques

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

5 th grade Common Core Standards

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Math Foundations 20 Work Plan

Linear programming III

Image Processing 1 (IP1) Bildverarbeitung 1

Differentiation Applications 1: Related Rates

The standards are taught in the following sequence.

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

CS 109 Lecture 23 May 18th, 2016

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Lab 1 The Scientific Method

A Matrix Representation of Panel Data

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

, which yields. where z1. and z2

Time, Synchronization, and Wireless Sensor Networks

Churn Prediction using Dynamic RFM-Augmented node2vec

Lecture 10, Principal Component Analysis

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Elements of Machine Intelligence - I

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Video Encoder Control

Department of Electrical Engineering, University of Waterloo. Introduction

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Physical Layer: Outline

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

ENSC Discrete Time Systems. Project Outline. Semester

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

CC283 Intelligent Problem Solving 28/10/2013

Lecture 8: Multiclass Classification (I)

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Five Whys How To Do It Better

MACHINE LEARNING FOR CLUSTER- GALAXY CLASSIFICATION

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

You need to be able to define the following terms and answer basic questions about them:

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Lecture 02 CSE 40547/60547 Computing at the Nanoscale

Chapter 11: Neural Networks

Support Vector Machines and Flexible Discriminants

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Hypothesis Tests for One Population Mean

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Chapter 15 & 16: Random Forests & Ensemble Learning

Inference in the Multiple-Regression

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

Dataflow Analysis and Abstract Interpretation

Determining the Accuracy of Modal Parameter Estimation Methods

AP Statistics Notes Unit Two: The Normal Distributions

INTRODUCTION TO MACHINE LEARNING FOR MEDICINE

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

Transcription:

Data Mining with Linear Discriminants Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Tday s Lecture Objectives 1 Recgnizing the ideas f artificial neural netwrks and their use in R 2 Understanding the cncept and the usage f supprt vectr machines 3 Being able t evaluate the predictive perfrmance in terms f bth metrics and the receiver perating characteristic curve 4 Distinguishing predictive and explanatry pwer Data Mining, with Linear Discriminants 2

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants 3

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Recap 4

Supervised vs. Unsupervised Learning Supervised learning Machine learning task f inferring a functin frm labeled training data Training data includes bth the input and the desired results crrect results (target values) are given Unsupervised learning Methds try t find hidden structure in unlabeled data The mdel is nt prvided with the crrect results during the training N errr r reward signal t evaluate a ptential slutin Examples: Hidden Markv mdels Dimensin reductin (e. g. by principal cmpnent analysis) Clustering (e. g. by k-means algrithm) grup int classes n the basis f their statistical prperties nly Data Mining, with Linear Discriminants: Recap 5

Feature B Feature B Taxnmy f Machine Learning Machine learning estimates functin and parameter in y = f(x,w) Type f methd varies depending n the nature f what is predicted Regressin Predicted value refers t a real number Cntinuus y Classificatin Clustering Predicted value refers t a class label Discrete y (e. g. class membership) Grup pints int clusters based n hw "near" they are t ne anther Identify structure in data Feature A Data Mining, with Linear Discriminants: Recap Feature A 6

K -Nearest Neighbr Classificatin Input: training examples as vectrs in a multidimensinal feature space, each with a class label N training phase t calculate internal parameters Testing: Assign t class accrding t k-nearest neighbrs Classificatin as majrity vte Prblematic Skewed data Unequal frequency f classes What label t assign t the circle?? Data Mining, with Linear Discriminants: Recap 7

Decisin Trees Flwchart-like structure in which ndes represent tests n attributes End ndes (leaves) f each branch represent class labels Example: Decisin tree fr playing tennis Outlk Sunny Overcast Rain Humidity Yes Wind High Nrmal Strng Weak N Yes N Yes Data Mining, with Linear Discriminants: Recap 8

Decisin Trees Issues Hw deep t grw? Hw t handle cntinuus attributes? Hw t chse an apprpriate attributes selectin measure? Hw t handle data with missing attributes values? Advantages Simple t understand and interpret Requires nly few bservatins Wrds, best and expected values can be determined fr different scenaris Disadvantages Infrmatin Gain criterin is biased in favr f attributes with mre levels Calculatins becme cmplex if values are uncertain and/r utcmes are linked Data Mining, with Linear Discriminants: Recap 9

Feature B k-means Clustering Partitin n bservatins int k clusters in which each bservatin belngs t the cluster with the nearest mean, serving as a prttype f the cluster Feature A Cmputatinally expensive; instead, we use efficient heuristics Default: Euclidean distance as metric and variance as a measure f cluster scatter Data Mining, with Linear Discriminants: Recap 10

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Linear Discriminants 11

Classificatin Prblems General classificatin prblem Gal: Take a new input x and assign it t ne f K classes C k Given training set X = [x 1... x n ] T with target values T = [t 1,...,t n ] T Number f dimensins D, i. e. x i R D Learn a discriminant functin y(x) t perfrm the classificatin 2-class prblem with binary target values t i {0,1} Decide fr class C 1 if y(x) > 0, else fr class C 2 K -class prblem with 1-f-K cding scheme, i. e. t i [0,1,0,0,0] T Data Mining, with Linear Discriminants: Linear Discriminants 12

Feature B Feature B Linear Separability If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable Feature A Linearly separable Feature A Nt linearly separable Data Mining, with Linear Discriminants: Linear Discriminants 13

Linear Discriminant Functins Decisin bundary given by y(x) = w T x + b = 0 defines a hyperplane Classes labeled accrding t sign ( w T x + b ) Nrmal vectr w and ffset b w y(x) > 0 y(x) = 0 x 2 y(x) < 0 w x 1 Data Mining, with Linear Discriminants: Linear Discriminants 14

Learning Discriminant Functins Linear discriminant functins given by y(x) = w T x + b = = D i=0 D i=1 w i x i + b w i x i with x 0 = 1 Weight vectr w "Bias" b, i. e. threshld Gal: Chse w and b, r w respectively, such that b t 1... w T X +. w T X T is minimal b t n Data Mining, with Linear Discriminants: Linear Discriminants 15

Chsing Discriminant Functin Slving w T X T by least-squares has drawbacks Least-squares is very sensitive t utliers Errr functin penalizes predictins that are "t crrect" Wrks nly fr linearly separable prblems Least-squares assumes Gaussian distributin 4 4 2 0 c c 2 0 c c -2-2 -4-4 -6-6 -8-8 -4-2 0 2 4 6 8-4 -2 0 2 4 6 8 Alternative slutins (e. g. in blue): Generalized linear mdels ( neural netwrks), supprt vectr machines, etc. frm Leibe (2010). Data Mining, with Linear Discriminants: Linear Discriminants 16

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Neural Netwrks 17

Generalized Linear Mdel Linear mdel y(x) = w T x + b Generalized linear mdel with activatin functin A y(x) = A ( w T x + b ) Other than least-squares, chice f activatin functin shuld limit influence f utliers, e. g. using a threshld as A A(z) 0.0 0.4 0.8 10 5 0 5 10 Data Mining, with Linear Discriminants: Neural Netwrks 18

Relatinship t Neural Netwrks In 2-class case y(x) = with x 0 = 1 D i=0 A(w i x i ) Single-layer perceptrn threshld w0 w1 y(x) wd utput weights In multi-class case y k (x) = with x 0 = 1 D i=0 A(w k,i x i ) Multi-class perceptrn threshlds wk0 y1(x) yk(x) utputs weights wki x0=1 x1 xd inputs X0 = 1 x1 xd inputs Data Mining, with Linear Discriminants: Neural Netwrks 19

Artificial Neural Netwrks Artificial neural netwrks (ANN) are cmputatinal mdels t cmpute f : X Y, inspired by the central nervus system Cmpute f by feeding infrmatin thrugh the netwrk Represented as a system f cnnected neurns ANNs are universal apprximatrs amng cntinuus functins (under certain mild assumptins) Input Hidden Output x 1 z 1 x N z M y 1 y 2 Data Mining, with Linear Discriminants: Neural Netwrks 20

Layers in Neural Netwrks Neurns are arranged in three (r mre) layers First layer: Input neurns receive the input vectr x X Hidden layer(s): Cnnect input and utput neurns Final layer: Output neurns cmpute a respnse ỹ Y Input Hidden Output x 1 z 1 x N z M y 1 y 2 When neurns are cnnected as a directed graph withut cycles, this is called a feed-frward ANN Data Mining, with Linear Discriminants: Neural Netwrks 21

Feeding Infrmatin thrugh Neural Netwrks Input z j f each neurn j = 1,...,M is a weighted sum f all previus neurns calculated as ( where z j = A w 0,j + N i=1 w i,j x i ) = A ( w 0,j + w T j x ) x i are the values frm the input layer suitable cefficients w i,j fr i = 1,...,N and j = 1,...,M Predefined nn-linear functin A is referred t as the activatin functin Frequent chice: Lgistic functin 1 A(z) = 1 + e z Cefficients w i,j learned frm e. g. a back-prpagatin algrithm Data Mining, with Linear Discriminants: Neural Netwrks 22

Lgistic Functin Lgistic A(z) 0.0 0.4 0.8 10 5 0 5 10 z Resembles a threshld functin { 0, z < 0, A(z) 1, z > 0 Data Mining, with Linear Discriminants: Neural Netwrks 23

Neural Netwrks in R Lading required library nnet library(nnet) Accessing credit scres library(caret) data(germancredit) Split data int index subset fr training (20%) and testing (80%) instances intrain <- runif(nrw(germancredit)) < 0.2 Data Mining, with Linear Discriminants: Neural Netwrks 24

Neural Netwrks in R Train neural netwrk with n ndes in the hidden layer via nnet(frmula, data=d, size=n...) nn <- nnet(class ~., data=germancredit[intrain,], size=15, maxit=100, rang=0.1, decay=5e-4) ## # weights: 946 ## initial value 139.233471 ## iter 10 value 120.009852 ## iter 20 value 111.986450 ## iter 30 value 92.182560 ## iter 40 value 88.672309 ## iter 50 value 85.914152 ## iter 60 value 85.220372 ## iter 70 value 85.121310 ## iter 80 value 85.061444 ## iter 90 value 84.634114 ## iter 100 value 81.262052 ## final value 81.262052 ## stpped after 100 iteratins Data Mining, with Linear Discriminants: Neural Netwrks 25

Neural Netwrks in R Predict credit scres via predict(nn, test, type="class") pred <- predict(nn, GermanCredit[-inTrain,], type="class") Cnfusin matrix via table(pred=pred_classes, true=true_classes) cm <- table(pred=pred, true=germancredit$class[-intrain]) Calculate accuracy sum(diag(cm))/sum(sum(cm)) ## [1] 0.7387 Data Mining, with Linear Discriminants: Neural Netwrks 26

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Supprt Vectr Machines 27

Supprt Vectr Machine (SVM) Which f these linear separatrs is ptimal? Idea: Maximize separating margin (here: A) Data pints n the margin are called supprt vectrs When calculating decisin bundary, nly supprt vectrs matter; ther training data is ignred Frmulatin as cnvex ptimizatin prblem with glbal slutin y B A x Data Mining, with Linear Discriminants: Supprt Vectr Machines 28

SVM in R Lading required library e1071 library(e1071) Accessing credit scres library(caret) data(germancredit) Split data int index subset fr training (20%) and testing (80%) instances intrain <- runif(nrw(germancredit)) < 0.2 Data Mining, with Linear Discriminants: Supprt Vectr Machines 29

SVM in R Train supprt vectr machine fr classificatin via svm(frmula, data=d, type="c-classificatin") mdel <- svm(class ~., data=germancredit[intrain,], type="c-classificatin") Predict credit scres fr testing instances test via predict(svm, test) pred <- predict(mdel, GermanCredit[-inTrain, ]) head(cbind(pred, GermanCredit$Class[-inTrain])) ## pred ## 2 2 1 ## 3 2 2 ## 4 2 2 ## 5 2 1 ## 6 2 2 ## 7 2 2 First rw gives predicted utcmes, secnd are actual (true) values Data Mining, with Linear Discriminants: Supprt Vectr Machines 30

SVM in R Cnfusin matrix via table(pred=pred_classes, true=true_classes) cm <- table(pred=pred, true=germancredit$class[-intrain]) cm ## true ## pred Bad Gd ## Bad 57 1 ## Gd 243 698 Calculate accuracy sum(diag(cm))/sum(sum(cm)) ## [1] 0.7558 Data Mining, with Linear Discriminants: Supprt Vectr Machines 31

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Predictin Perfrmance 32

Assessment f Mdels 1 Predictive perfrmance (measured by accuracy, recall, F1, ROC,... ) 2 Cmputatin time fr bth mdel building and predicting 3 Rbustness t nise in predictr values 4 Scalability 5 Interpretability transparency, ease f understanding Mdel Allws n < k Interpret. # Tuning Parameters Rbust t Predictr Nise Cmp. Time Linear Regressin 0 ANN 1 2 SVM/SVR 1 3 k-nn 1 Single Tree 1 Randm Frest 0 1 Bsted Trees 3 frm Kuhn & Jhnsn (2013). Applied Predictive Mdeling. Data Mining, with Linear Discriminants: Predictin Perfrmance 33

Accuracy vs. Precisin Accuracy Clseness t the actual (true) value Precisin Similarity f repeated measurements under unchanged cnditins High accuracy, but lw precisin High precisin, but lw accuracy Data Mining, with Linear Discriminants: Predictin Perfrmance 34

Cnfusin Matrix Cnfusin matrix (als named cntingency table r errr matrix) displays predictive perfrmance Cnditin (as determined by Gld standard) True False Psitive Outcme True Psitive (TP) False Psitive (FP) Type I Errr False Alarm Precisin r Psitive Predictive Value = TP TP+FP Negative Outcme False Negative (FN) Type II Errr / Miss True Negative (TN) Sensitivity = TP Rate = TP TP+FN Specificity = TN Rate = TN FP+TN Accuracy = TP+TN Ttal Equivalent with hit rate and recall Data Mining, with Linear Discriminants: Predictin Perfrmance 35

Cnfusin Matrix Example: Bld prbe t test fr cancer Patient with Cancer True False Psitive Bld Test Outcme TP: Cancer crrectly diagnsed FP: Healthy persn diagnsed cancer Precisin = TP TP+FP Negative Bld Test Outcme FN: Cancer nt diagnsed TN: Healthy persn diagnsed as healthy Sensitivity = TP Rate = TP TP+FN Specificity = TN Rate = TN FP+TN Accuracy = TP+TN Ttal Different lss functins: Redundant, e 1000 check in FP case, cmpared t lethal utcme in FN case Data Mining, with Linear Discriminants: Predictin Perfrmance 36

Assessing Predictin Perfrmance Imagine the fllwing cnfusin matrix with an accuracy f 65% Patient with Cancer True False Psitive Bld Test Outcme Negative Bld Test Outcme TP = 60 FP = 5 Precisin = TP TP+FP 0.92 FN = 30 TN = 5 Sensitivity = TP TP+FN 0.67 Specificity = TN FP+TN = 0.50 Accuracy = TP+TN Ttal = 0.65 Is this a "gd" result? N, because f unevenly distributed data, a mdel which always guesses psitive will scre an accuracy f 60% Data Mining, with Linear Discriminants: Predictin Perfrmance 37

Predictin Perfrmance in R Cnfusin matrix in variable cm ## True False ## Ps 30 10 ## Neg 20 40 Calculating accuracy (cm[1,1]+cm[2,2])/ (cm[1,1]+cm[1,2]+cm[2,1]+cm[2,2]) ## [1] 0.7 # Alternative that wrks als fr multi-class data sum(diag(cm))/sum(sum(cm)) ## [1] 0.7 Calculating precisin cm[1, 1]/(cm[1, 1] + cm[1, 2]) ## [1] 0.75 Data Mining, with Linear Discriminants: Predictin Perfrmance 38

Trade-Off: Sensitivity vs. Specificity/Precisin Perfrmance gals frequently place mre emphasis n either sensitivity r specificity/precisin Example: Airprt scanners triggered n lw-risk items like belts (lw precisin), but reduce risk f missing risky bjects (high sensitivity) Trade-Off: F1 scre is the harmnic mean f precisin and sensitivity 2TP F1 = 2TP + FP + FN Visualized by receiver perating characteristic (ROC) curve Data Mining, with Linear Discriminants: Predictin Perfrmance 39

Sensitivity Receiver Operating Characteristic (ROC) ROC illustrates perfrmance f binary classifier as its discriminatin threshld y(x) is varied Interpretatin: Curve A is randm guessing (50% crrect guesses) Curve frm mdel B perfrms better than A, but wrse than C Curve C frm perfect predictin Area suth-east f curve is named area under the curve and shuld be maximized 1 0.75 0.5 0.25 0 C B A 0 0.25 0.5 0.75 1 1-Specifity Data Mining, with Linear Discriminants: Predictin Perfrmance 40

ROC in R Lad required library proc library(proc) Perfrm predictin and retrieve decisin values dv pred <- predict(mdel, GermanCredit[-inTrain,], decisin.values=true) dv <- attributes(pred)$decisin.values Data Mining, with Linear Discriminants: Predictin Perfrmance 41

ROC in R Plt ROC curve via plt.rc(classes, dv) plt.rc(as.numeric(germancredit$class[-intrain]), dv, xlab = "1 - Specificity") Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 1 Specificity ## ## Call: ## plt.rc.default(x = as.numeric(germancredit$class[-intrain]), predictr = dv, xlab = ## ## Data: dv in 300 cntrls (as.numeric(germancredit$class[-intrain]) 1) > 699 cases (as.num ## Area under the curve: 0.692 Data Mining, with Linear Discriminants: Predictin Perfrmance 42

Predictive vs. Explanatry Pwer Significant difference between predicting and explaining: 1 Empirical Mdels fr Predictin Empirical predictive mdels (e. g. statistical mdels, methds frm data mining) designed t predict new/future bservatins Predictive Analytics describes the evaluatin f the predictive pwer, such as accuracy r precisin 2 Empirical Mdels fr Explanatin Any type f statistical mdel used fr testing causal hypthesis Use methds fr evaluating the explanatry pwer, such as statistical tests r measures like R 2 Data Mining, with Linear Discriminants: Predictin Perfrmance 43

Predictive vs. Explanatry Pwer N = 15 N = 100 Explanatry pwer des nt imply predictive pwer Red is the best explanatry mdel; gray the best predictive In particular, dummies d nt translate well t predictive mdes D nt write smething like "the regressin prves the predictive pwer f regressr x i " 44 Data Mining, with Linear Discriminants: Predictin Perfrmance

Overfitting When learning algrithm is perfrmed fr t lng, the learner may adjust t very specific randm features nt related t the target functin Overfitting: Perfrmance n training data (in gray) still increases, while the perfrmance n unseen data (in red) becmes wrse y 0.4 0.6 0.8 1.0 Mean Squared Errr 0.000 0.010 0.020 0.05 0.10 0.15 0.20 0.25 0.30 0.35 x 2 4 6 8 10 12 Flexibility Data Mining, with Linear Discriminants: Predictin Perfrmance 45

Overfitting in R Given data x t predict y plt(x, y, pch = 20) Estimate plynmials f rder P = 1,...,6 and cmpare errrs n training and testing data intrain <- runif(length(x)) < 0.2 err.fit <- c() err.pred <- c() y 50 30 10 10 2 0 2 4 6 8 fr (i in 1:6) { # Explanatry mdel xx <- x[intrain] m <- lm(y[intrain] ~ ply(xx, i, raw=true)) err.fit[i] <- sqrt(mean(m$residuals^2)) # Predictive mdel xx <- x[-intrain] err <- predict(m, ply(xx, i, raw=true))-y[-intrain] err.pred[i] <- sqrt(mean(err^2)) } x Data Mining, with Linear Discriminants: Predictin Perfrmance 46

Overfitting in R Perfrmance n training data (in gray) still increases, while the perfrmance n unseen data (in red) becmes wrse plt(1:6, err.fit, type='l', ylim=c(3.5, 31), cl="gray", xlab="order f plynmial", ylab="errr n training/testing data") lines(1:6, err.pred, cl="firebrick3") Errr n training/testing data 5 10 15 20 25 30 1 2 3 4 5 6 Order f plynmial Data Mining, with Linear Discriminants: Predictin Perfrmance 47

Supprt Vectr Regressin in R Supprt vectr regressin (SVR) predicts cntinuus values Accessing weekly stck market returns fr 21 years library(islr) data(weekly) intrain <- runif(nrw(weekly)) < 0.2 When training, use type="eps-regressin" instead mdel <- svm(tday ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data=weekly[intrain,], type="eps-regressin") Cmpute rt mean square errr (RMSE) pred <- predict(mdel, Weekly[-inTrain, ]) sqrt(mean(pred - Weekly$Tday[-inTrain])^2) ## [1] 0.1278 given by RMSE = 1 N N i=1 (pred i true i ) 2 and cmpare t standard deviatin f σ 2 = 2.3536 (N in denminatr, nt N 1) Data Mining, with Linear Discriminants: Predictin Perfrmance 48

Outline 1 Recap 2 Linear Discriminants 3 Artificial Neural Netwrks 4 Supprt Vectr Machines 5 Predictin Perfrmance 6 Wrap-Up Data Mining, with Linear Discriminants: Wrap-Up 49

Summary: Data Mining with Linear Discriminants Linearly separable Artificial neural netwrk Supprt vectr machine Cnfusin matrix Data can be perfectly classified by linear discriminant Feeding infrmatin thrugh the netwrk nnet(frmula, data=d, size=n...) Maximize separating margin svm(frmula, data=d, type=...) Tabular utline f crrect/incrrect guesses Predictive perfrmance Measured by accuracy, precisin,... ROC curve Overfitting Visualize trade-ff between sensitivity and specificity Perfrmance n training data increases, different frm unseen bservatins Data Mining, with Linear Discriminants: Wrap-Up 50