Machine Learning: Evaluation

Size: px

Start display at page:

Download "Machine Learning: Evaluation"

Darcy Higgins
5 years ago
Views:

1 Machine Learning: Evaluation Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

2 Comparison of Algorithms

3 Comparison of Algorithms Is algorithm A better than algorithm B? depends on task/ dataset difference might be due to limited size of dataset Same quality measure and dataset(s) should be used to evaluate A and B.

4 Comparison Schemes Paired Test Both algorithms are trained on the same n datasets and use the same n test datasets. E.g. 10-fold CV: 1. train each A and B on fold f 2... f n, evaluate each on f 1. We get quality measures a 1 and b train each A and B on fold f 1, f 3... f n, evaluate each on f 2. We get quality measures a 2 and b We get results a 1,..., a n and b 1,..., b n. The results a i and b i are paired. Reduces variance in the quality estimations.

5 Comparison Schemes Unpaired Test Given are n results for A and m for method B. E.g. one researcher has implemented logistic regression and evaluated it on the iris dataset with 10-fold CV. Another researcher has implemented a decision tree and evaluates it on iris with 5-fold CV. They can compare their results with an unpaired test.

6 Hypothesis testing Hypothesis Testing is a statistical method for comparing two test series. Hypothesis H 0 : There is no difference. This hypothesis is tested with a significance level α. E.g 5%. H 0 is also called the null hypothesis. A hypothesis test tries to falsify/ reject the null hypothesis. If we can reject the null hypothesis, the results are significantly different.

7 Hypothesis testing We deal with two testing schemes: Wald test for normal distributed data general test t-test for testing means of normal distributed data good for small sample sizes

8 Paired Wald-Test Data: n samples X1,..., X n for algorithm A Y1,..., Y n for algorithm B Xi and Y i are paired! Hypothesis: δ = 0

9 Paired Wald-Test Z i := X i Y i (because X i and Y i are paired!) δ = E(Z) = E(X ) E(Y ) δ is estimated by ˆδ = X Y ˆδ is a random variable the estimated standard error ŝe(ˆδ) of ˆδ is ŝe(ˆδ) = s 2 Z n with s 2 Z := 1 n 1 n (Z i Z) 2 i=1

10 Paired Wald-Test The normalized random variable W describing the error is: W := ˆδ ŝe(ˆδ) = 1 n(n 1) Z n (Z i Z) 2 As W N(0, 1) we can conclude that the difference is with probability of at least α in: C n = Z ± z α/2 1 n (Z i Z) n(n 1) 2 We can test our hypothesis δ = 0: Method 1: If 0 Cn than reject H 0. I.e. there should be a difference between the two algorithms. Method 2: If W > z α/2 then reject H 0. i=1 i=1

11 Unpaired Wald-Test Data: n samples of A, m samples of B X1,..., X n for algorithm A Y1,..., Y m for algorithm B Xi and Y i are independent! Hypothesis: δ = 0

12 Unpaired Wald-Test δ = E(X ) E(Y ) δ is estimated by ˆδ = X Y ˆδ is a random variable the estimated standard error ŝe(ˆδ) of ˆδ is sx 2 ŝe(ˆδ) = n + s2 Y m with su 2 := 1 k (U i U) 2 k 1 i=1

13 Unpaired Wald-Test The normalized random variable W describing the error is: W := ˆδ ŝe(ˆδ) = X Y s 2 X n + s2 Y m As W N(0, 1) we can conclude that the difference is with probability of at least α in: C n = X Y ± z α/2 s 2 X n + s2 Y m We can test our hypothesis δ = 0: Method 1: If 0 C n than reject H 0. I.e. there should be a difference between the two algorithms. Method 2: If W > zα/2 then reject H 0.

14 t-tests The Wald-test is based on W N(0, 1). For small sample sizes, the approximation with a normal is inaccurate. In fact for small sample sizes W is t-distributed: W t n 1 f (x) = α n (1 + x 2 ) n+1 2 n with α n = Γ ( ) n+1 2 Γ ( ) n 2

15 t-tests t-tests can be performed by using t n,α instead of z α. Where n are the degrees of freedom. Paired t-test: degrees of freedom: k = n 1 Method: If W > tk,α/2 then reject H 0. Unpaired t-test: degrees of freedom: k = min{n, m} 1 Method: If W > tk,α/2 then reject H 0.

16 Example Paired t-test with level α = 0.05 for the data: fold 1 fold 2 fold 3 fold 4 fold 5 Method A Method B see blackboard...

How do we compare the relative performance among competing models?

How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model