-
Notifications
You must be signed in to change notification settings - Fork 58
Precision, Recall and F1 measure
This is a brief explanation of the implementation of the Precision, Recall and F1-measures in GERBIL with a focus on its corner cases.
In some rare cases, the calculation of Precision or Recall can cause a division by 0. Regarding the precision, this can happen if there are no results inside the answer of an annotator and, thus, the true as well as the false positives are 0. For these special cases, we have defined that if the true positives, false positives and false negatives are all 0, the precision, recall and F1-measure are 1. This might occur in cases in which the gold standard contains a document without any annotations and the annotator (correctly) returns no annotations. If true positives are 0 and one of the two other counters is larger than 0, the precision, recall and F1-measure are 0.
Since every dataset contains a lot of single documents, we implemented Micro and Macro versions of Precision, Recall and F1-measure. Here, we will explain the difference between micro and macro precision very briefly. For the complete equations, take a look at [1].
For computing the micro precision all true positives and false positives of all documents are summed up. These sums are used to calculate a single micro precision value.
In contrast a single precision value for every document can be calculated. The macro precision is the average of these single per-document precisions.
We can summarize that micro measures show the performance over the set of all annotations inside the dataset while macro measures show the average performance per document. Thus, in some cases these measures can have values with large differences. Let's assume that a dataset comprises three documents. Two of these documents have exactly 1 annotation that should be found by an annotator while the third document does not have any annotations. Let's assume that the annotator we are evaluating does not work well and that it always returns an empty result. The following table contains the counts for this example (tp = true positive, fp = false positive, fn = false negative, p = precision, r = recall, f1 = F1-score).
annotations | tp | fp | fn | p | r | f1 | |
---|---|---|---|---|---|---|---|
doc 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
doc 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
doc 3 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
sums (micro) | 2 | 0 | 0 | 2 | 0 | 0 | 0 |
avg (macro) | - | - | - | - | 1/3 | 1/3 | 1/3 |
It can be seen that while the micro measure are all 0, the macro measures are 1/3.
[1] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis und Lars Wesemann. GERBIL -- General Entity Annotation Benchmark Framework. In Proceedings of the International World Wide Web Conference (WWW) (Practice & Experience Track), ACM (2015).