![DALL·E 2024-12-30 15.46.45 - A wallpaper image featuring two M&M-shaped candies of different colors (e.g., red and blue), represented as 3D wireframe meshes. Vectors connect the t.webp](DALL%C2%B7E%202024-12-30%2015.46.45%20-%20A%20wallpaper%20image%20featuring%20two%20M%26M-shaped%20candies%20of%20different%20colors%20(e.g.%2C%20red%20and%20blue)%2C%20represented%20as%203D%20wireframe%20meshes.%20Vectors%20connect%20the%20t.webp)*MeshMetrics is as sweet as M&M's—what better way to measure the precision of your segmentation model's performance? 🪅* # ⏱️ TL;DR ⏱️ The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually **computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results**. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare **11 open-source tools** for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation. ![paper_overview.png](paper_overview.png)*Overview of our study, highlighting the key steps and methods used in the analysis.* # 🚀 Need precise 2D or 3D distance-based segmentation metrics? We've got you covered! 🚀 Below is a simple usage example of MeshMetrics for 3D segmentation masks. For more examples, check out the [examples.ipynb](https://github.com/gasperpodobnik/MeshMetrics/blob/main/examples.ipynb) notebook. ```bash sudo apt update && sudo apt install -y libxrender1 git clone https://github.com/gasperpodobnik/MeshMetrics.git pip install MeshMetrics/ ``` ```python from pathlib import Path import SimpleITK as sitk from MeshMetrics import DistanceMetrics data_dir = Path("data") # initialize DistanceMetrics object dist_metrics = DistanceMetrics() # read binary segmentation masks ref_sitk = sitk.ReadImage(data_dir / "example_3d_ref_mask.nii.gz") pred_sitk = sitk.ReadImage(data_dir / "example_3d_pred_mask.nii.gz") # set input masks and spacing (only needed if both inputs are numpy arrays or vtk meshes) dist_metrics.set_input(ref=ref_sitk, pred=pred_sitk) # Hausdorff Distance (HD), by default, HD percentile is set to 100 (equivalent to HD) hd100 = dist_metrics.hd() # 95th percentile HD hd95 = dist_metrics.hd(percentile=95) # Mean Average Surface Distance (MASD) masd = dist_metrics.masd() # Average Symmetric Surface Distance (ASSD) assd = dist_metrics.assd() # Normalized Surface Distance (NSD) with tau=2 nsd2 = dist_metrics.nsd(tau=2) # Boundary Intersection over Union (BIoU) with tau=2 biou2 = dist_metrics.biou(tau=2) ``` # ⬇️ Discover why our implementation is more accurate than other open-source tools ⬇️ ## 🔧 Comparison of 11 open-source tools 🔧 ![Posnetek zaslona 2024-12-30 160223.png](Posnetek%20zaslona%202024-12-30%20160223.png)*Overview of 11 open-source tools analyzed in this study, indicating the supported distance-based metrics. For HD_p, the checkmark (✓) denotes the support of any percentile, whereas a specific number indicates the implementation of a predefined percentile (e.g. 95-th).* ## 🧮 Deep dive into mathematical definitions 🧮 In their straightforward application, all distance-based metrics compare two segmentations, $ A $ and $ B $, which are commonly represented as binary segmentation masks $ A_\mathcal{M} $ and $ B_\mathcal{M} $ in biomedical image analysis. The mathematical definitions of HD (and HD_p), MASD, and ASSD are based on two finite sets $ A_P = \{a_1, a_2, \dots, a_N\} $ and $ B_P = \{b_1, b_2, \dots, b_M\} $, representing clouds of $ N $ and $ M $ points that describe the surfaces of $ A_\mathcal{M} $ and $ B_\mathcal{M} $, respectively. These sets are used to compute distances between the segmentations (see figure below, cf. PointDef). Assuming $ D_{AB} $ denotes the set of one-directional (asymmetric) distances from points $ a \in A_P $ to $ B_P $, and $ [D_{AB}]_p $ denotes the $ p $-th percentile of $ D_{AB} $, HD_p measures the largest among the bi-directional (symmetric) percentile distances $ [D_{AB}]_p $ and $ [D_{BA}]_p $: HD_p = $\max([D_{AB}]_p, [D_{BA}]_p)$. The 100th percentile variant, HD₁₀₀, or simply HD, is computed by replacing the percentile calculation with the maximum over both sets: HD = $\max(\max(D_{AB}), \max(D_{BA}))$. Other variants are less sensitive to outliers, such as noise and artifacts. The 95th percentile variant, HD₉₅, is the most established one. Often referred to as the *average surface distance*, MASD computes the mean of the two average distances of sets $ D_{AB} $ and $ D_{BA} $, while ASSD computes the mean of the union of both sets $ D_{AB} $ and $ D_{BA} $. On the other hand, NSD and BIoU are defined using the boundary representation of the segmentation masks. Here, the term "boundary" is context agnostic regarding dimensionality, referring to a contour line in 2D and a surface in 3D. Also referred to as the *normalized surface Dice*, NSD quantifies the overlap of the two segmentation boundaries, $ A_\mathcal{B} $ and $ B_\mathcal{B} $, within a specified margin of error ($ \tau $). A region $ A^{(\pm \tau)} $ is first defined both inward ($ -\tau $) and outward ($ +\tau $) of boundary $ A_\mathcal{B} $. NSD is then computed as the ratio of $ B_\mathcal{B} $ within $ A^{(\pm \tau)} $ and vice versa, against the total size of both boundaries, represented by the contour circumference in 2D and surface area in 3D. Finally, BIoU enhances sensitivity to boundary segmentation errors compared to plain IoU or DSC, which are often saturated for bulk overlaps between two large segmentations. A region $ A^{(-\tau)} $ is first defined inward ($ -\tau $) of $ AB $, and BIoU is computed by measuring the intersection of $ A^{(-\tau)} $ and $ B^{(-\tau)} $ over their union. ![Posnetek zaslona 2024-12-30 162847.png](Posnetek%20zaslona%202024-12-30%20162847.png)*Mathematical definitions of distance-based metrics as adopted by the point-based definition PointDef, different open-source tools, and our reference implementation MeshMetrics.* ## ⚠️ Limitations ⚠️ The mathematical definitions of HD with HD_p, MASD, and ASSD rely on distances between the point sets describing the boundaries of $ A $ and $ B $. In the discrete image space, these boundaries are represented by line polygons in 2D and surface meshes in 3D. However, these point sets lack complete information about the segmentation boundaries, making them a lossy representation of the segmentation masks. The point-based definitions of distance-based metrics inherently neglect the spatial distribution of points across the boundaries, assuming they are uniformly distributed. Ensuring that each point used for distance calculation (referred to as the query point) corresponds to boundary regions of uniform size is practically challenging and introduces bias into the metrics computation. **For unbiased computation, the size of the boundary element —represented by the length of the line segment in 2D and the area of the (triangular) surface element in 3D — must be taken into account at each query point. The distances calculated between query points and the opposing boundary must then be weighted by the corresponding boundary element sizes, as supported by both theoretical perspectives and experimental results.** ## 📐 Conceptual Analysis 📐 A detailed analysis of open-source tools revealed notable differences in metrics calculation strategies (see figure with equations above) and the applied boundary extraction methods (see figure below). A critical pitfall in the implementation of distance-based metrics is **boundary extraction**, where the first issue is the foreground- vs. boundary-based calculation dilemma. Among the 11 open-source tools, **EvaluateSegmentation** and **SimpleITK** are the only two that omit the boundary extraction step and instead calculate distances between all mask foreground elements, using the pixel/voxel centers as query points. The authors of **EvaluateSegmentation** even proposed several optimization strategies to improve computational efficiency, such as excluding intersecting elements, since their distances are always zero. **Plastimatch** is the only tool that returns both foreground- and boundary-based calculations, while other tools support only boundary-based calculations. The most frequently employed boundary extraction method, used by **Anima**, **MedPy**, **MetricsReloaded**, **MISeval**, **MONAI**, and **Plastimatch**, involves morphological erosion using an 8-square-connectivity structural element. In contrast, **seg-metrics** uses a full-connectivity structural element. Conversely, **Google DeepMind** and **pymia** employ a strategy where the image grid is shifted by half a pixel/voxel size and incorporate the calculation of boundary element sizes, i.e., the lengths of line segments in 2D and areas of surface elements in 3D. As previously explained, **EvaluateSegmentation** computes distances solely for non-overlapping elements, while **SimpleITK** computes distances for all foreground elements. Differently from all the 11 open-source tools, our reference implementation **MeshMetrics** adopts a meshing boundary extraction strategy using discrete flying edges in 2D and discrete marching cubes in 3D. ![Posnetek zaslona 2024-12-30 163715.png](Posnetek%20zaslona%202024-12-30%20163715.png)*Overview of the boundary extraction methods used by the 11 open-source tools and our reference implementation, MeshMetrics. All methods are demonstrated using 2D examples but can be seamlessly extended to 3D.* ## 🔬 Quantitative Experiments 🔬 To quantitatively evaluate the combined effect of variations in mathematical definitions and boundary extraction methods on the resulting metric scores, we designed a series of experiments. These experiments assess the accuracy of the 11 open-source tools for distance-based metrics computation, along with an analysis of edge case handling and computational efficiency. ### Experimental Design In contrast to counting metrics, distance-based metrics rely on distance calculations and thus require the image grid to be defined in distance units, i.e., with pixel size $ (dx, dy) $ or voxel size $ (dx, dy, dz) $. For our 3D CT and MR images, we conducted experiments using three different voxel sizes: 1. $ (1.0, 1.0, 1.0) \, \text{mm} $, simulating the vanilla scenario with isotropic voxels of unit size. 2. $ (2.0, 2.0, 2.0) \, \text{mm} $, a commonly used isotropic voxel size. 3. $ (0.5, 0.5, 2.0) \, \text{mm} $, a commonly used anisotropic voxel size in the radiotherapy workflow. The experiments followed this procedure (see diagram below): 1. For a chosen CT or MR image, the two OAR segmentation masks were loaded with their original image voxel size. 2. Meshing was performed to extract triangle meshes from the segmentation masks. 3. Mesh rasterization (i.e., voxelization in our 3D case) was performed to obtain each of the three voxel sizes, which were then used to compute the distance-based metrics by each of the 11 open-source tools. 4. Meshing was again performed for each voxelized mask to compute the distance-based metrics using our reference implementation, **MeshMetrics**. For metrics that depend on user-defined parameters (e.g., percentile $p$ for HD_p, and boundary margin of error $ \tau $ for NSD and BIoU), we conducted experiments with HD (p = 95), NSD ($\tau = 2\, \text{mm}$), and BIoU ($\tau = 2\, \text{mm}$), as these values are commonly used in existing studies and provide good insight into the differences among the tools. ![experiment_design.png](experiment_design.png)*Experimental design for distance-based metrics computation using the 11 open-source tools and our reference implementation, MeshMetrics. Meshes are first generated from the original segmentation masks, followed by rasterization (voxelization) to three different voxel sizes. The rasterized (voxelized) masks are then used as inputs to the open-source tools, while corresponding highly accurate meshes are generated and used as inputs to MeshMetrics.* ### Evaluation of Differences Comparing different implementations for individual OARs would be challenging due to the disparity of results. Therefore, we focus on analyzing the deviations between each open-source tool and **MeshMetrics**, defined as: $\Delta_j = m_{j,i} - m_r$, where $ m_i$ is the metric score of the $j$-th open-source tool and $m_r$ is the reference metric score of **MeshMetrics**. Distance-based metrics can be divided into two groups: 1. **Absolute metrics** (HD with (HD_p), MASD, and ASSD): These are measured in metric units (i.e., millimeters for our case) and have no upper bound. A lower score reflects greater similarity between two masks. 2. **Relative metrics** (NSD and BIoU): These are unitless and bounded between 0 and 1. A higher score reflects greater similarity between two masks. In this analysis, positive $\Delta_j$ values indicate an overestimation of the metric score (i.e., over-pessimistic for absolute and over-optimistic for relative metrics), while negative $\Delta_j$ values indicate an underestimation of the metric score (i.e., over-optimistic for absolute and over-pessimistic for relative metrics). Although both over- and under-estimation are undesirable, over-optimistic estimates are particularly concerning, as they may lead to incorrect conclusions, especially when compared to metric scores reported in existing studies. ## 📊 Results 📊 ![results_boxplots.png](results_boxplots.png)*The differences ($\Delta$) in the distance-based metrics, as obtained by each open-source tool against the reference implementation **MeshMetrics** on 1,561 pairs of segmentation masks, are reported for two isotropic and one anisotropic voxel size in the form of box plots. Note that a linear scale is applied for values within the interval $[-1, 1]$, and a symmetric logarithmic scale is used for values outside this interval to better visualize both smaller and larger differences.* ## 🧵 Discussion 🧵 The results presented in the boxplot figure above reveal that for **HD**, the mean deviations are close to zero across all tools except for **MISeval**, which highlights the general agreement on its mathematical definition, supported by mostly non-significant statistical comparisons (see our full paper for details). For **HD₉₅**, the results are more scattered due to larger differences in its mathematical definition. Particularly for large and tubular **OARs**, **Plastimatch** produced the greatest outliers of up to −115 mm due to the averaging of the directed percentile calculations rather than taking the maximum of both directed percentiles, which generates over-optimistic results. Similarly, over-optimistic performance is observed for **MedPy** and **seg-metrics**, both computing the percentile on the union of both distance sets that generally produces smaller metric values when coupled with their boundary extraction methods. Although **EvaluateSegmentation** applies the same calculation, it uses a different boundary extraction method that shifts the distribution toward higher distances and results in over-pessimistic performance. While **MONAI** and **MetricsReloaded** appear as the most accurate for isotropic grids, they exhibit several outliers and perform worse for the anisotropic grid. Notably, **Google DeepMind** is not only very close to these tools in terms of mean deviation for isotropic grids but also exhibits greater precision, outperforming all other tools in both accuracy and precision for the anisotropic grid. These findings align well with our conceptual analysis, as tools relying on the point-based definition (i.e., assuming a uniform distribution of query points) performed significantly worse than **Google DeepMind** and **pymia**, which account for boundary element sizes. Interestingly, low deviations in mean differences are observed for **MASD** and **ASSD** across all open-source tools, as averaging over large sets of distances is statistically more stable and tends to conceal the variations in metrics implementation. However, it is crucial to also consider the range (min/max) of deviations, which further emphasize the need for unified mathematical definitions. For example, the comparison of **Google DeepMind** and **MetricsReloaded** clearly demonstrates that the former is more consistent. Particularly for **ASSD**, these discrepancies are even more pronounced for the anisotropic grid. Although supported by four open-source tools, only two distinct calculations are employed for **NSD**: a point-based method by **MetricsReloaded/MONAI**, and a mesh-based method by **GoogleDeepMind/pymia**, with identical metric scores within pairs due to the same boundary extraction method. However, concerning outliers ranging from −16.1%pt to 34.8%pt are consistently observed across **OAR** categories for both calculations, causing statistically significant differences against **MeshMetrics** and among the tools. As **GoogleDeepMind**, **MetricsReloaded**, and **MONAI** are deemed popular according to their GitHub stars ranking, questions arise about the accuracy of existing studies applying these tools. For **MetricsReloaded** and **MONAI**, the outliers can partly be attributed to not accounting for boundary element sizes, leading to an unjustified assumption of uniformly distributed query points. Surprisingly, even **Google DeepMind**, the original **NSD** implementation, shows signs of inaccuracy and imprecision, likely due to quantized distances coupled with the choice of **τ**. As **MeshMetrics** uses implicit distance calculations, it is considerably less susceptible to quantization. These empirical findings highlight that, among all distance-based metrics, **NSD** is the most sensitive to distance and boundary calculations. Since **BIoU** is a recently proposed metric, it is currently implemented only by **MetricsReloaded**, which provides valid results only for the isotropic grid of unit size due to a flaw in its code. However, even in this case, there are significantly large deviations from **MeshMetrics** ranging from −15.5%pt to 35.5%pt, which can be attributed to quantized distances coupled with the choice of **τ**. We therefore suggest exercising caution when interpreting **NSD** and **BIoU**. ## ❗ Conclusion ❗ In conclusion, we would first like to acknowledge the authors of the 11 open-source tools for their valuable contributions. The outcomes of our study should not be regarded as criticism, but rather as a constructive step towards proper metrics implementation and usage. Based on our detailed conceptual and quantitative analyses, we propose the following recommendations for distance-based metrics computation: - **Consult metrics selection guidelines**, such as *Metrics Reloaded*, to identify representative metrics for the specific application. - **Ensure reproducibility** by reporting the name and version of the open-source tool, as the choice of the tool can significantly impact the obtained results and their interpretation. - **Be mindful of edge case handling** by not relying solely on open-source tools, but by implementing additional external conditioning. - **Pay attention to the correct usage of units of measurement** and pixel/voxel sizes. - **Use MeshMetrics** for highly accurate reference measurements. If not feasible, use Google DeepMind for HD, HDp, MASD, and ASSD due to its superior performance on both isotropic and anisotropic grids in comparison to other open-source tools, and exercise caution for NSD and BIoU due to implementation discrepancies. We therefore hope that *Metrics Revolutions* with its groundbreaking insights will raise community awareness and understanding of the computational principles behind the distance-based metrics, and stimulate community members towards a more careful interpretation of segmentation results for both existing and future studies. As our findings suggest, studies that evaluated segmentation by using the 11 open-source tools may need to be revisited for implementation errors, and we eagerly contribute to this effort by offering MeshMetrics as a reference for the implementation of metrics for biomedical image segmentation. ## ❣️Acknowledgements ❣️ This study was supported by the Slovenian Research and Innovation Agency (ARIS) under projects No. J2-1732, J2-4453, J2-50067 and P2-0232, and by the European Union Horizon project ARTILLERY under grant agreement No. 101080983. # 📖 Note 📖 This is a summarized version of our full paper, which includes a more in-depth analysis of computational efficiency and statistical testing. The full paper is available [here](https://arxiv.org/abs/2410.02630).