MeshMetrics is as sweet as M&M's—what better way to measure the precision of your segmentation model's performance? 🪅
The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.
Overview of our study, highlighting the key steps and methods used in the analysis.
Below is a simple usage example of MeshMetrics for 3D segmentation masks. For more examples, check out the examples.ipynb notebook.
sudo apt update && sudo apt install -y libxrender1 git clone https://github.com/gasperpodobnik/MeshMetrics.git pip install MeshMetrics/
from pathlib import Path import SimpleITK as sitk from MeshMetrics import DistanceMetrics data_dir = Path("data") # initialize DistanceMetrics object dist_metrics = DistanceMetrics() # read binary segmentation masks ref_sitk = sitk.ReadImage(data_dir / "example_3d_ref_mask.nii.gz") pred_sitk = sitk.ReadImage(data_dir / "example_3d_pred_mask.nii.gz") # set input masks and spacing (only needed if both inputs are numpy arrays or vtk meshes) dist_metrics.set_input(ref=ref_sitk, pred=pred_sitk) # Hausdorff Distance (HD), by default, HD percentile is set to 100 (equivalent to HD) hd100 = dist_metrics.hd() # 95th percentile HD hd95 = dist_metrics.hd(percentile=95) # Mean Average Surface Distance (MASD) masd = dist_metrics.masd() # Average Symmetric Surface Distance (ASSD) assd = dist_metrics.assd() # Normalized Surface Distance (NSD) with tau=2 nsd2 = dist_metrics.nsd(tau=2) # Boundary Intersection over Union (BIoU) with tau=2 biou2 = dist_metrics.biou(tau=2)
Overview of 11 open-source tools analyzed in this study, indicating the supported distance-based metrics. For HDp, the checkmark (✓) denotes the support of any percentile, whereas a specific number indicates the implementation of a predefined percentile (e.g. 95-th).
In their straightforward application, all distance-based metrics compare two segmentations,
Assuming
Often referred to as the average surface distance, MASD computes the mean of the two average distances of sets
On the other hand, NSD and BIoU are defined using the boundary representation of the segmentation masks. Here, the term "boundary" is context agnostic regarding dimensionality, referring to a contour line in 2D and a surface in 3D. Also referred to as the normalized surface Dice, NSD quantifies the overlap of the two segmentation boundaries,
Finally, BIoU enhances sensitivity to boundary segmentation errors compared to plain IoU or DSC, which are often saturated for bulk overlaps between two large segmentations. A region
Mathematical definitions of distance-based metrics as adopted by the point-based definition PointDef, different open-source tools, and our reference implementation MeshMetrics.
The mathematical definitions of HD with HDp, MASD, and ASSD rely on distances between the point sets describing the boundaries of
The point-based definitions of distance-based metrics inherently neglect the spatial distribution of points across the boundaries, assuming they are uniformly distributed. Ensuring that each point used for distance calculation (referred to as the query point) corresponds to boundary regions of uniform size is practically challenging and introduces bias into the metrics computation.
For unbiased computation, the size of the boundary element —represented by the length of the line segment in 2D and the area of the (triangular) surface element in 3D — must be taken into account at each query point. The distances calculated between query points and the opposing boundary must then be weighted by the corresponding boundary element sizes, as supported by both theoretical perspectives and experimental results.
A detailed analysis of open-source tools revealed notable differences in metrics calculation strategies (see figure with equations above) and the applied boundary extraction methods (see figure below).
A critical pitfall in the implementation of distance-based metrics is boundary extraction, where the first issue is the foreground- vs. boundary-based calculation dilemma. Among the 11 open-source tools, EvaluateSegmentation and SimpleITK are the only two that omit the boundary extraction step and instead calculate distances between all mask foreground elements, using the pixel/voxel centers as query points. The authors of EvaluateSegmentation even proposed several optimization strategies to improve computational efficiency, such as excluding intersecting elements, since their distances are always zero.
Plastimatch is the only tool that returns both foreground- and boundary-based calculations, while other tools support only boundary-based calculations. The most frequently employed boundary extraction method, used by Anima, MedPy, MetricsReloaded, MISeval, MONAI, and Plastimatch, involves morphological erosion using an 8-square-connectivity structural element. In contrast, seg-metrics uses a full-connectivity structural element.
Conversely, Google DeepMind and pymia employ a strategy where the image grid is shifted by half a pixel/voxel size and incorporate the calculation of boundary element sizes, i.e., the lengths of line segments in 2D and areas of surface elements in 3D. As previously explained, EvaluateSegmentation computes distances solely for non-overlapping elements, while SimpleITK computes distances for all foreground elements.
Differently from all the 11 open-source tools, our reference implementation MeshMetrics adopts a meshing boundary extraction strategy using discrete flying edges in 2D and discrete marching cubes in 3D.
Overview of the boundary extraction methods used by the 11 open-source tools and our reference implementation, MeshMetrics. All methods are demonstrated using 2D examples but can be seamlessly extended to 3D.
To quantitatively evaluate the combined effect of variations in mathematical definitions and boundary extraction methods on the resulting metric scores, we designed a series of experiments. These experiments assess the accuracy of the 11 open-source tools for distance-based metrics computation, along with an analysis of edge case handling and computational efficiency.
In contrast to counting metrics, distance-based metrics rely on distance calculations and thus require the image grid to be defined in distance units, i.e., with pixel size
The experiments followed this procedure (see diagram below):
For metrics that depend on user-defined parameters (e.g., percentile
Experimental design for distance-based metrics computation using the 11 open-source tools and our reference implementation, MeshMetrics. Meshes are first generated from the original segmentation masks, followed by rasterization (voxelization) to three different voxel sizes. The rasterized (voxelized) masks are then used as inputs to the open-source tools, while corresponding highly accurate meshes are generated and used as inputs to MeshMetrics.
Comparing different implementations for individual OARs would be challenging due to the disparity of results. Therefore, we focus on analyzing the deviations between each open-source tool and MeshMetrics, defined as:
Distance-based metrics can be divided into two groups:
Absolute metrics (HD with (HDp), MASD, and ASSD): These are measured in metric units (i.e., millimeters for our case) and have no upper bound. A lower score reflects greater similarity between two masks.
Relative metrics (NSD and BIoU): These are unitless and bounded between 0 and 1. A higher score reflects greater similarity between two masks.
In this analysis, positive
Although both over- and under-estimation are undesirable, over-optimistic estimates are particularly concerning, as they may lead to incorrect conclusions, especially when compared to metric scores reported in existing studies.
The differences (
The results presented in the boxplot figure above reveal that for HD, the mean deviations are close to zero across all tools except for MISeval, which highlights the general agreement on its mathematical definition, supported by mostly non-significant statistical comparisons (see our full paper for details). For HD95, the results are more scattered due to larger differences in its mathematical definition. Particularly for large and tubular OARs, Plastimatch produced the greatest outliers of up to −115 mm due to the averaging of the directed percentile calculations rather than taking the maximum of both directed percentiles, which generates over-optimistic results. Similarly, over-optimistic performance is observed for MedPy and seg-metrics, both computing the percentile on the union of both distance sets that generally produces smaller metric values when coupled with their boundary extraction methods.
Although EvaluateSegmentation applies the same calculation, it uses a different boundary extraction method that shifts the distribution toward higher distances and results in over-pessimistic performance. While MONAI and MetricsReloaded appear as the most accurate for isotropic grids, they exhibit several outliers and perform worse for the anisotropic grid. Notably, Google DeepMind is not only very close to these tools in terms of mean deviation for isotropic grids but also exhibits greater precision, outperforming all other tools in both accuracy and precision for the anisotropic grid.
These findings align well with our conceptual analysis, as tools relying on the point-based definition (i.e., assuming a uniform distribution of query points) performed significantly worse than Google DeepMind and pymia, which account for boundary element sizes. Interestingly, low deviations in mean differences are observed for MASD and ASSD across all open-source tools, as averaging over large sets of distances is statistically more stable and tends to conceal the variations in metrics implementation. However, it is crucial to also consider the range (min/max) of deviations, which further emphasize the need for unified mathematical definitions.
For example, the comparison of Google DeepMind and MetricsReloaded clearly demonstrates that the former is more consistent. Particularly for ASSD, these discrepancies are even more pronounced for the anisotropic grid. Although supported by four open-source tools, only two distinct calculations are employed for NSD: a point-based method by MetricsReloaded/MONAI, and a mesh-based method by GoogleDeepMind/pymia, with identical metric scores within pairs due to the same boundary extraction method. However, concerning outliers ranging from −16.1%pt to 34.8%pt are consistently observed across OAR categories for both calculations, causing statistically significant differences against MeshMetrics and among the tools.
As GoogleDeepMind, MetricsReloaded, and MONAI are deemed popular according to their GitHub stars ranking, questions arise about the accuracy of existing studies applying these tools. For MetricsReloaded and MONAI, the outliers can partly be attributed to not accounting for boundary element sizes, leading to an unjustified assumption of uniformly distributed query points. Surprisingly, even Google DeepMind, the original NSD implementation, shows signs of inaccuracy and imprecision, likely due to quantized distances coupled with the choice of τ. As MeshMetrics uses implicit distance calculations, it is considerably less susceptible to quantization.
These empirical findings highlight that, among all distance-based metrics, NSD is the most sensitive to distance and boundary calculations. Since BIoU is a recently proposed metric, it is currently implemented only by MetricsReloaded, which provides valid results only for the isotropic grid of unit size due to a flaw in its code. However, even in this case, there are significantly large deviations from MeshMetrics ranging from −15.5%pt to 35.5%pt, which can be attributed to quantized distances coupled with the choice of τ. We therefore suggest exercising caution when interpreting NSD and BIoU.
In conclusion, we would first like to acknowledge the authors of the 11 open-source tools for their valuable contributions. The outcomes of our study should not be regarded as criticism, but rather as a constructive step towards proper metrics implementation and usage. Based on our detailed conceptual and quantitative analyses, we propose the following recommendations for distance-based metrics computation:
We therefore hope that Metrics Revolutions with its groundbreaking insights will raise community awareness and understanding of the computational principles behind the distance-based metrics, and stimulate community members towards a more careful interpretation of segmentation results for both existing and future studies.
As our findings suggest, studies that evaluated segmentation by using the 11 open-source tools may need to be revisited for implementation errors, and we eagerly contribute to this effort by offering MeshMetrics as a reference for the implementation of metrics for biomedical image segmentation.
This study was supported by the Slovenian Research and Innovation Agency (ARIS) under projects No. J2-1732, J2-4453, J2-50067 and P2-0232, and by the European Union Horizon project ARTILLERY under grant agreement No. 101080983.
This is a summarized version of our full paper, which includes a more in-depth analysis of computational efficiency and statistical testing. The full paper is available here.