Statistics for classification

Size: px

Start display at page:

Download "Statistics for classification"

Opal McDonald
5 years ago
Views:

1 AstroInformatics

2 Statistics for classification Una rappresentazione utile è la matrice di confusione. L elemento sulla riga i e sulla colonna j è il numero assoluto oppure la percentuale di casi della classe vera i che il classificatore ha classificato nella classe j. Sulla diagonale principale ci sono i casi classificati correttamente. Gli altri sono errori. A B C Totale A ,0% B ,7% C ,2% Totale ,0% Nel training set ci sono 200 casi. Nella classe A ci sono 87 casi: 60 classificati correttamente come A 27 classificati erroneamente, dei quali 14 come B e 13 come C Sulla classe A l accuratezza è 60 / 87 = 69,0%. Sulla classe B è 34 / 60 = 56,7% e sulla classe C è 42 / 53 = 79,2%. L accuratezza media è ( ) / 200 = 136 / 200 = 68,0%. Gli errori sono il 32%, cioè 64 casi su 200. Il valore di questa classificazione dipende non solo dalle percentuali, ma anche dal costo delle singole tipologie di errore. Ad es. se C è la classe che è più importante classificare bene, il risultato è considerabile positivo M. Brescia - Data Mining - lezione 4 2

Confusion Matrix A binary classifier has two possible output classes. The response is also known as: o Output variable; o Label; o Target; o Dependent variable.

3 Confusion Matrix A binary classifier has two possible output classes. The response is also known as: o Output variable; o Label; o Target; o Dependent variable. Let's now define the most basic terms: true positives (TP): predicted yes (correct), and they are really correct. true negatives (TN): We predicted no (wrong), and they are really wrong. false positives (FP): We predicted yes, but they are really wrong. false negatives (FN): We predicted no, but they are really correct. M. Brescia - Data Mining - lezione 4 3

4 Classification estimators There is a list of basic rates often computed from a confusion matrix for a binary classifier: Classification accuracy: fraction of patterns (objects) correctly classified, with respect to the total number of objects in the sample; Purity/Completeness: fraction of objects correctly classified, for each class; Contamination: fraction of objects erroneously classified, for each class DICE: The Sorensen Dice index, also known as F1-score, is a statistic used for comparing the similarity of two samples. DICE = 2 X Y X + Y = 2 AB A 2 + B 2 = 2TP 2TP + FP + FN 5 basic quality evaluation criteria, by exploiting the output representation through the confusion matrix M. Brescia - Data Mining - lezione 4 4

5 Classification estimators More in general: Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 = 0.91 Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 = 0.09 equivalent to 1 - Accuracy also known as "Error Rate" True Positive Rate: When it's actually yes, how often does it predict yes? TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall or Completeness False Positive Rate: When it's actually no, how often does it predict yes? FP/actual no = 10/60 = 0.17 Specificity: When it's actually no, how often does it predict no? TN/actual no = 50/60 = 0.83 equivalent to 1 - FPR Precision (Purity): When it predicts yes, how often is it correct? TP/predicted yes = 100/110 = 0.91 Prevalence: How often does the yes condition actually occur in our sample? actual yes/total = 105/165 = 0.64 M. Brescia - Data Mining - lezione 4 5

6 ROC curve Together with DICE estimator, another useful operator is the ROC curve. ROC (Receiver Operating Characteristics) is a statistical estimator used to assess the predictive power of a binary classifier (Logistic Regression Model). It comes from Signal Theory, but heavily used in Analytics and it is suitable graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. M. Brescia - Data Mining - lezione 4 6

7 ROC curve To draw a ROC curve, only TPR and FPR are needed. TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test. A ROC space is defined by FPR and TPR as x and y axes respectively, which shows relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 specificity, the ROC graph is sometimes called the sensitivity vs (1 specificity) plot. Each prediction result or instance of a confusion matrix represents one point in the ROC space. M. Brescia - Data Mining - lezione 4 7

8 ROC curve The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point is also called a perfect classification. A completely random guess would give a point along a diagonal line (the so-called line of no-discrimination) from the left bottom to the top right corners (regardless of the positive and negative base rates). An intuitive example of random guessing is a decision by flipping coins. As the sample size increases, a random classifier's ROC point migrates towards (0.5,0.5). M. Brescia - Data Mining - lezione 4 8

9 ROC classifier estimation in order to compare arbitrary classifiers, the Receiver Operating Characteristic or ROC curve plots may give a quick evaluation on their behavior. The overall effectiveness of the algorithm is measured by the area under the ROC curve, where an area of 1 represents a perfect classification, while an area of.5 indicates a useless result (i.e. like a flipped coin). It is obtained by varying the threshold used to discriminate among classes. If the target labels is the [0,1] range, the ROC plot is built by calculate the couple TPR and FPR for each value of a threshold, i.e. (0, 0.1, 0.15, 0.2, 0.95, 1) and plotting all these results, by describing a curve. The ROC value is therefore the area under that curve. M. Brescia - Data Mining - lezione 4 9

10 Probability Density Function In regression experiments, where the goal is to predict a distribution based on a restricted sample of a true population (KB or Knowledge Base), the usual mechanism is to infer the knowledge acquired on the true sample through a model able to learn the hidden and unknown correlation between data parameter space and the expected output. A typical example in astrophysics is the prediction of the photometric redshift for millions sky objects by learning the hidden correlation between the multi-band photometric fluxes and the spectroscopic redshift (almost precise, thus considered true redshift). The true redshift is usually known for a very limited sample of objects (spectroscopically observed). The advantage to predict the photo-z is that on real cases the photometric catalogues are quite easier to be obtained rather than to use very complex and expensive spectroscopic observation runs and reduction. By forcing a model F to predict a single-point estimation y=f(x) may yield largely inaccurate prediction errors (outliers). While the prediction based on a Probability Density Function PDF(x) may reduce or minimize physical bias (systematic errors) and the occurrence of outliers as well. In other words, PDF improves performance, at the price of a more complex mechanism to infer and calculate the prediction. M. Brescia - Data Mining - lezione 4 10

Probability Density Function The importance of PDF is that in most real life problems it is impossible to answer to questions like: Given a distribution function of a random variable X, what is the

11 Probability Density Function The importance of PDF is that in most real life problems it is impossible to answer to questions like: Given a distribution function of a random variable X, what is the probability that X is exactly equal to a certain value n? We could better answer to questions like: what is the probability the X is between n-a and n+a? This corresponds to calculate the area under the interval a and +a, e.g. the probability will be an integral value, not a single point!!! There exists a plethora of statistical methods which can produce a PDF for analytical problems approached by traditional models (deterministic/probabilistic models). But in the case of models for which an analytical expression y=f(x) does not exist (such as machine learning models), it is extremely difficult to find a well posed PDF(x), since it is intrinsically complex to split error due to the model itself from error embedded in the data. And, important, a PDF is well-posed only for known (continuous) probability distributions. M. Brescia - Data Mining - lezione 4 11

12 Confidence Statistics As said before, p(z) cannot be verified on a single-point basis. A large sample, however, does support p(z) verification. Let s assume to have a population of observed objects N. For a limited amount of them we know their real nature. For others we applied an estimation model which predicted their nature with a residual uncertainty, i.e. it provides a probability p(z). We want to verify such estimation reliability and accuracy. What we expect? About 1% of the predictions should be quite perfect, i.e. their p(z) extremely close to the real value at least with 99% of confidence level. What happens if such amount occurs for, let say, about 40%? the model estimation suffers of overconfidence, e.g. too much precise prediction with respect to the supported evidence! What happens in the opposite case (i.e. less than 0.5%)? the model is underconfident! In many astrophysical cases, astronomers spend much time to calibrate their measurements, by taking strongly under control the error budget of the observation quality. They remove most of the bias (systematic effects) sources, tune the physical models, increase the statistics of the used samples, comparing with past experience and using empirical (magic) rules of thumbs, etc This means that in most cases the results are overconfident. M. Brescia - Data Mining - lezione 4 12

13 Confidence statistics The key of the idea is the concept of the confidence interval Let s suppose to have a statistical variable X distributed on a population with mean µ and variance σ 2. We want to build a confidence interval for µ at level 1-α based on a simple random sample (x 1 x n ) with dimension n. The quantity 1-α is called confidence level. In practice we want to find an interval of values which contain the true value of statistical estimator µ. First, we have to distinguish the case when the variance is known from that one when it is unknown. M. Brescia - Data Mining - lezione 4 13

14 Confidence Statistics Variance known (rare case in the real world) The sampling mean x ҧ = 1 σ n i=1 n x i (1) is a random variable distributed approximately like a Gaussian N μ, σ2 approximation improves by extending the sample dimension n. σ 2 n measures the precision of the estimator (1). n and this When n σ 2 the (1) is more precise. Hence standard x~n ҧ μ, σ2 n Z = implies that X μ σ x μ ҧ σ 2 n ~N 0,1 we can use the z-score of a normal M. Brescia - Data Mining - lezione 4 14

15 Confidence Statistics Therefore, for each probability value 1-α, we can write: P z α/2 x μ ҧ σ 2 n z α/2 = 1 α (2) Where z α/2 is the quantile of the gaussian distribution of order 1-α/2, i.e. the point leaving a left area under the gaussian equal to 1-α/2. The values of the quantile are usually tabulated for each distribution. M. Brescia - Data Mining - lezione 4 15

Confidence Statistics Quantiles of a standard Gaussian distribution. The table reports the quantiles p 0 +p 1 of a distribution N(0,1). Remind that a standard Gaussian is symmetric around zero.

16 Confidence Statistics Quantiles of a standard Gaussian distribution. The table reports the quantiles p 0 +p 1 of a distribution N(0,1). Remind that a standard Gaussian is symmetric around zero. Therefore the quantiles with p<0.5 can be obtained by considering -p=-(1-p) (see example below). example To obtain the quantile of a N(0,1) means to calculate P N 0,1 x = = p 0 +p 1 = => find the cross value between 0.90 and => x = 1.96 The symmetric value corresponds to calculate P N 0,1 x = = Therefore x = 1.96 is the quantile and of the distribution N(0,1). M. Brescia - Data Mining - lezione 4 16

17 Confidence Statistics The confidence interval can then be built by expanding the formula P z α/2 x μ ҧ z α/2 = 1 α (2) σ 2 n confidence interval(s) (3) In other words, the probability that the intervals (3) contain the true value of the mean µ of the population is approximately equal to the confidence level 1-α. M. Brescia - Data Mining - lezione 4 17

18 Confidence Statistics The confidence level 1-α indicates the «level» of the coverage given by the confidence intervals (3). In other words, it always exists a residual probability α that the sampling data come from a population with mean outside that intervals. σ Consider that the (3) is centered on the estimate of mean xҧ with a radius equal to z 2 α/2, n which length depends on the desired level of coverage (i.e. depending on the chosen quantile) and on the precision degree of the estimator measured by standard error of the estimate. σ 2 n, which is called We speak about multiple confidence interval(s) because any choice of the quantile determines a different confidence interval. Let s see an example M. Brescia - Data Mining - lezione 4 18

19 Confidence Statistics - Example From an observed image, after reduction, we calculated the absolute magnitudes of the brightest stars present in that sky region. Then we know that these magnitudes are distributed with a variance of σ 2 = 16 squared magnitudes. We want to calculate a confidence interval with confidence level of 95% (~2σ) for the mean of magnitudes. Let s consider 10 stars with absolute magnitudes: 7.36, 11.91, 12.91, 9.77, 5.99, 10.91, 9.57, 11.01, 6.11, We start from the sampling mean and its standard error: σ 2 ഥm = 1 σ 10 i=1 10 m i = and = 16 = Since we fixed a confidence level of 95%, then 1 α = 0.95 and consequently α = 0.05 Therefore the desired quantile is z α/2 = z 0.05/2 = z = 1.96 The radius of the confidence interval is indeed given by z α/2 σ 2 n = = Therefore the confidence interval is [( ), ( )] = [7.2866, ] We have 95% of confidence that the true value of the mean magnitude of the bright stars in that sky region is between 7.29 and M. Brescia - Data Mining - lezione 4 19

20 Confidence Statistics - Example What happens if we increase the confidence level at about 3σ (99.7%)? We start from the sampling mean and its standard error: ഥm = 1 σ 10 i=1 10 m i = and σ2 = 16 = Since now the confidence level is 99.7%, then 1 α = and consequently α = Therefore the desired quantile is z α/2 = z 0.003/2 = z The radius of the confidence interval is indeed given by z α/2 σ 2 n = = Therefore the confidence interval is [( ), ( )] = [5.6041, ] We have 99.7% of confidence that the true value of the mean magnitude of the bright stars in that sky region is between 5.60 and In practice, by increasing the confidence level the radius of the confidence level is increased. This is obvious, since a better confidence implies to enlarge the interval for the true value of the mean estimator M. Brescia - Data Mining - lezione 4 20

21 Confidence Statistics What does it changes if the variance is unknown? In the most frequent cases of the real world, the precise estimate of the variance of a population is difficult (if not impossible) Variance unknown (real world) The formula σ 2 = 1 σ n n 1 i=1 x i xҧ 2 = n n 1 1 σ n i=1 n x 2 i xҧ 2 (4) is the sampling variance corrected by the factor n n 1, due to the fact that, for small samples the sampling variance is a distorted estimate, whose precision increases with the n sample dimension. For large samples 1 and the (4) becomes the standard expression of the variance. n 1 In such cases, to obtain a correct confidence interval for the mean µ of the population, we must consider that the distribution of the random variable ഥx μ follows the t of Student σ 2 n with n-1 Degrees of Freedom (DoF), where n is the dimension of the extracted sample. M. Brescia - Data Mining - lezione 4 21

22 Student T-distribution The Student t-distribution describes small samples drawn from a full population normally distributed. It s useful to evaluate the difference between two sample means, to assess their statistical significance. It occurs whenever it is considered the following random variable: (5) t-student with the variance of the compared sample S 2 = 1 σ N N 1 i=1 X i തX 2 This statistics, if the sample is Gaussian, is the ratio between a normal standard N(0,1) and a Chi-square divided by n-1. The t-student is symmetric like a Gaussian around zero but has «heavier tails» than a normal distribution, i.e. values far from 0 have a higher probability to be extracted than in the case of a standard Gaussian distribution. But such differences decrease by increasing the dimension n of the sample. In the figure, the t-student with degrees of freedom (ν) approximates the Gaussian N(0,1). Gaussian M. Brescia - Data Mining - lezione 4 22

23 Student T-distribution The construction of a confidence interval for the mean estimator is similar to the previous case and here the quantiles play a key role. By taking the quantile of the t- Student (n-1 DoF) distribution of order 1-α/2, defined as t n 1,α/2 the confidence interval will be derived by the usual chain of inequalities: We obtain that it is approximately equal to 1-α the probability that the interval below contains the true value of the mean µ of the population. The radius is usually smaller than the one obtained previously. That s because the sample provides always an estimate of the variance smaller than the previous case. M. Brescia - Data Mining - lezione 4 23

24 Recap of Confidence Statistics Known variance By summarising (under the hypothesis that a population is normally distributed) Unknown variance In many real situations it is preferred to infer an interval of any parameter estimate, rather than a single value. Such interval should indicate also the error associated to the estimate. A confidence interval of any parameter ϴ (such as the mean or the variance) of a population is an interval, bounded by two limits L inf and L sup, with a defined probability (1- α) to contain the true parameter of the population: p L inf < θ < L sup = 1 α where 1-α is the confidence level and α error probability M. Brescia - Data Mining - lezione 4 24

25 PDF statistics As underlined before, p(z) cannot be verified on a single-point basis. A large sample, however, does support p(z) verification. Let s assume to have a population of observed objects N. For a limited amount of them we know their real nature. For others we applied an estimation model which predicted their nature with a residual uncertainty, i.e. it provides a probability p(z). We want to verify such estimation reliability and accuracy. The key concept is the confidence interval (CI) We can analyze the over/under-confidence of our model prediction by checking if any x% of the samples have their true value within their x% CI, y% have the true value within their y% CI, etc. This can be done by calculating and plotting the Empirical Cumulative Distribution Function (ECDF) after having obtained the p(z) for our model (known as posterior probability). M. Brescia - Data Mining - lezione 4 25

26 Empirical Cumulative Distribution Function An empirical cumulative distribution function (CDF) is a non-parametric estimator of the underlying CDF of a random variable. It assigns a probability to each datum, orders the data from smallest to largest in value, and calculates the sum of the assigned probabilities up to and including each datum. The result is a step function that increases at each datum. The ECDF is usually denoted by F x or P X x and is defined as F(x) = n 1 I(x i x) i=1..n I()is the Indicator function I x i x = ቊ 1, x i x 0, x i > x Essentially, to calculate the value of F x at x, simply (1) count the number of data less than or equal to x; (2) divide the number found by the total number n of data in the sample. M. Brescia - Data Mining - lezione 4 26

27 ECDF Empirical Cumulative Distribution Function The ECDF is useful because: it approximates the true CDF well if the sample size (the number of data) is large, and knowing the distribution is helpful for statistical inference; a plot of the ECDF can be visually compared to known CDFs of frequently used distributions to check if the data came from one of those common distributions; it can visually display how fast the CDF increases to 1; hence, it can be useful to get a feel for the data; for example check for over- or under- confidence of any prediction (Wittman et al. 2016). M. Brescia - Data Mining - lezione 4 27

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se