Probabilistic spatial analysis in quantitative microscopy with uncertainty-aware cell detection using deep Bayesian regression

The investigation of biological systems with three-dimensional microscopy demands automatic cell identification methods that not only are accurate but also can imply the uncertainty in their predictions. The use of deep learning to regress density maps is a popular successful approach for extracting cell coordinates from local peaks in a postprocessing step, which then, however, hinders any meaningful probabilistic output. We propose a framework that can operate on large microscopy images and output probabilistic predictions (i) by integrating deep Bayesian learning for the regression of uncertainty-aware density maps, where peak detection algorithms generate cell proposals, and (ii) by learning a mapping from prediction proposals to a probabilistic space that accurately represents the chances of a successful prediction. Using these calibrated predictions, we propose a probabilistic spatial analysis with Monte Carlo sampling. We demonstrate this in a bone marrow dataset, where our proposed methods reveal spatial patterns that are otherwise undetectable.


Introduction
Advances in microscopy imaging currently enable the inspection of cells within biological tissues with astonishing resolution.In particular, fluorescence microscopy (FM) offers the possibility to separately examine distinct cellular structures stained with carefully selected markers.Visual inspection of such FM datasets in the context of large 3D images of intact tissues has revealed new information on the structures and mechanisms of different biological tissues [1].Effective quantitative description of cellular distributions in such large datasets relies on the accuracy of cell detection, which is the task of identifying the presence and location of cells in the images.Despite the advances in bioimage analysis suites [2,3,4], this task remains challenging [5,6] and is often addressed using burdensome and time-consuming manual annotations [1].
Cell detection has been a fundamental challenge in the spatial characterization of hematopoiesis, the process by which blood cells are created, which mostly occurs within bone marrow (BM) tissues [7].Hematopoiesis takes place within a structural framework or microenvironment provided by non-hematopoietic, so-called stromal cells, which critically modulate the behaviour of hematopoietic progenitors through specific interactions in restricted anatomical spaces denominated niches [8,9,10].A constantly increasing number of stromal cell subsets has been described to date and their accurate identification and detection within microscopy images is key in the investigation of their functional features and dynamics in health and disease.As for all complex multicellular tissues, attempts for understanding spatial patterns and cellular interactions that underlie organ function in the bone marrow rely critically on the quantification of distances between cellular coordinates as well as to larger supracellular anatomical structures of interest, such as blood vessels or bone surfaces [1,11,12,13].
A framework for quantification of spatial distributions was proposed in [14], with the use of point processes [15,16,17,18].This was demonstrated for the spatial analysis of two key stromal components within bone marrow sections: the sinusoids that form the microvasculature of the BM, and a pool of mesenchymal cells of fibroblastic morphology termed CXCL12-abundant reticular cells (CAR cells).The proposed spatial analysis was used to confirm the presumed preferential localization of CAR cells near sinusoids, for which no quantitative evidence had existed thus far.A number of descriptive statistics were reported in [14] as reference spatial descriptors of this biological system.However, although the analysis in that report provided well-established statistical methods for hypothesis testing, its strength was limited in that only the variability of the results across samples was taken into account, but the analysis did not reflect any potential mistakes in the cell detection method employed.In addition, the simplistic image processing techniques employed were time-consuming and inherently limited in accuracy.
Automatic cell detection is widely studied topic.A group of methods utilize instance segmentation algorithms (e.g.watershed [19,20,21,22]) that assign each pixel of the image a label (cell type or background) and an identifier unique to each of the cell instances.Although this can allow for describing cell morphology, most spatial analyses only require localization and hence center coordinates alone, thereby rendering detection algorithms more suitable for this purpose.While conventional blob detection methods [23,24] are still employed in biological studies due to their simplicity [6,25,26,27], supervised deep learning (DL) approaches are slowly establishing themselves as the state of the art.The advantage of detection over instance segmentation becomes even more prominent for DL methods, where the effectiveness is determined primarily by the availability of high-quality manual annotations, since cell coordinate annotations requires far less effort than dense annotation of individual pixels contained within cells.
Unlike other discriminative tasks, e.g.segmentation or classification, where end-to-end DL approaches have been widely adopted [28,29], object detection implies an underlying structure among output elements, making it a set prediction problem, for which the application of typical convolutional neural networks (CNNs) solutions is hampered [30].As a result, several alternatives have recently been proposed [31,32,33,34,35,36,37,38,39], where CNNs are employed for extracting informative features, which are then subsequently processed to infer object coordinates or bounding boxes.Although these approaches have achieved remarkable results in natural images such as ImageNet [40], they are highly dependent on post-processing steps that hinder their implementation and usability [41].
If the objects to be detected are of relatively similar size and/or their exact extents are irrelevant for the subsequent analysis to be conducted (e.g.cell counting or distances to each other or to surrounding structures), then bounding boxes are not required.In these cases, the shortcomings described above can be circumvented by using CNNs to regress density maps (DMs) generated from ground truth (GT) annotations of cell locations [42].Cell coordinates are then detected within these DMs as peaks, for instance with non-maximum suppression (NMS) algorithms.Such a framework then enables the adoption of typical CNN architectures commonly employed in segmentation (e.g.UNet [28]) for the detection of cells and other landmarks, as utilized in recent works [43,44,45,46,47,48,49].Nevertheless, DM regression with CNNs poses two major caveats: (i) detection of local peaks introduces additional hyperparameters that must be tuned for different applications, and (ii) detection outputs have no probabilistic interpretation, hence uncertainties in predictions cannot be known.
Calibration is the term that refers to how well a probabilistic output value indicates actual probability of prediction success.For instance, a model that identifies 100 cells all with a probability of 0.6 is said to be perfectly calibrated if 60 of these cells turn out to be correct according to some GT, and the other 40 are incorrect.Although typical segmentation CNNs already output values that can be interpreted as prediction probabilities, most DL approaches are known to be poorly calibrated, invalidating any such probabilistic interpretation [50].Bayesian approaches that have traditionally excelled at confidence calibration have been recently incorporated in common CNN architectures.These deep Bayesian learning methods have been shown to result in better calibration by accounting for two different types of uncertainties [51].Epistemic uncertainty is the lack of confidence of a model on its parameters and it may be reduced by leveraging additional labeled data.For estimating epistemic uncertainty, a technique called Monte Carlo (MC) Dropout has been demonstrated to be a good Bayesian approximator of Gaussian process models [52,53], and has later been incorporated into segmentation CNNs to take pixel-wise uncertainties into account while improving the quality of their predictions [54].This technique was inspired by dropout layers that were originally proposed as a regularization method to mitigate overfitting by randomly removing network parameters during training [55].MC dropout instead employs random removal after training during the inference phase, which enables epistemic uncertainty estimation.Aleatoric uncertainty is implicit to the data and cannot be avoided.Its estimation was proposed in [51] by utilizing a loss function that accounts for noise associated with observations while simultaneously predicting this as well.
Bayesian DL models on medical images have been employed to improve calibration of segmentation confidence [56] and to predict uncertainties as a proxy to estimate segmentation quality on previously unseen data [57].The advantages of simultaneously accounting for both epistemic and aleatoric uncertainties have been studied recently in FM images [58], indicating that uncertainty estimation is not only beneficial for superior segmentation results but also for an accurate estimation of prediction quality on previously unseen data.
In this work, we build on the aforementioned fields by designing a probabilistic DL framework for calibrated DM-based cell detection, demonstrated in the context of large 3D FM images.We validate our methods on the detection of CAR cells included in the BM dataset described in [14].In earlier work, the complex 3D morphology of BM and varying intensities of FM had hindered an automatic and accurate detection of such cells, which we overcome with the methods presented herein, while furthermore demonstrating probabilistic techniques for hypothesis testing.
Below we first give a short overview of conventional CNN-based cell detection using DMs, which we later improve with a comparative analysis for different tiling strategies and design choices.Next, we present a novel strategy with the integration of deep Bayesian methods for probabilistic classification of cell proposals, which leads to a substantial improvement in the calibration of detection results.We finally apply our cell detection method to an extended BM stroma dataset for the spatial characterization of CAR cells, revising our previously reported results.This study highlights the benefits of employing well calibrated models allowing for a probabilistic analysis that accounts for the confidence associated to each predicted cell.This leads to findings that would be overlooked by conventional deterministic analysis.

Cell detection in density maps
We study the problem of cell detection on a 3D FM dataset of BM stroma samples, as illustrated in our overview Fig. 1.These samples are decomposed into different patches of images acquired from CXCL12-GFP, a fluorescent protein employed for the visualization of CAR cells (details in Methods -Dataset and tiling strategy).In order to quantify the quality of detection methods, an assignment is needed between predicted cell coordinates and GT annotations, such that they can be designated as true positives (TP), false positives (FP), or false negatives (FN).Simple assignment strategies such as nearest neighbours may lead to bias in method evaluations, e.g. by assigning multiple detections to a single GT and vice versa -a common but undesirable scenario [30].We herein employ the Hungarian algorithm, which finds optimal one-to-one assignments based on all relative distances between predictions and GT, as illustrated in Supplementary Fig. 1.The resulting metrics are reported following a 4-fold cross-validation approach (more details in Methods -Evaluation details).
Since DL-based regression of DMs has previously been successful in the detection of cell-like objects, we adopt this framework to localize cells as peaks in predicted DMs, whose values are related to the likelihood of occurrences.However, we need to prevent detection of multiple coordinates within a neighbourhood where the typical spatial extent of a cell would physically prevent the presence of another.To this end, we employ NMS to iteratively detect cells (peaks) while avoiding those closer than average cell radius (details in Section Methods -Density map generation).Not to detect small prediction noise as local peaks, iterative NMS process is typically terminated with a stopping criterion when no peak is left above a DM threshold, which is an empirically set hyperparameter often without further analysis in the literature [42], thus we evaluate it for different DM design strategies in the next section.

Tiling strategy and density map design for large images
While DMs are used as a suitable framework for cell detection, their design specifics vary considerably.DMs encode cells as a local kernel with a peak at its centre and monotonically decreasing outwards.We specifically employ a 3D Gaussian kernel with a standard deviation σ (illustrated in Fig. 2a), the nearly flat peak of which is advantageous for learning from GT annotations that may not be precise.In this section, we propose the use of two DM design strategies, which are first evaluated on GT DMs for comparisons to remain agnostic to chosen DM prediction methods.
Traditional tiling methods for image segmentation utilize input image patches larger than expected prediction output in order to take into account the pixels lost in convolutional layers [59,60], which we refer to as convolutional margin (M conv ).In the context of detection, M conv often leads to detection of multiple duplicated cells in neighbouring patches, which need to be somehow combined, often leading to a lower detection precision (Fig. 2c).This effect is accentuated for higher values of σ and lower DM thresholds, since these factors increase the chances of localizing coordinates at patch borders, where the actual peak may lie within a neighbouring patch.The M peak tiling strategy we propose employs a smaller output patch (magenta), which is obtained by cropping the CNN output (yellow) with a small distance to ensure that duplicated detections falling within that margin are considered only in one of the patches (yellow arrow in Fig. 2b,c).
The second strategy (K max ) is designed to address a problem arising in the GT DM generation by using the process of kernel sum (K sum ), an approach common in earlier works [28,42].Adding kernels with K sum from close-by coordinates may erroneously result in single peaks.In addition, K sum artificially increases the dynamic range of DMs complicating CNN predictions and confounding peak density with prediction strength (nearby peaks leading to relatively larger GT DM responses).This problem is illustrated in Supplementary Fig. 2, and marked with a blue arrow in an example image in Fig. 2d.With K max , Gaussians are compounded by their maximum, as recently proposed in a different experimental setting [49], hence facilitating their separation by interpreting densities as an intersection rather than an accumulation, and eliminating the sensitivity of detection metrics to the spatial spread of a kernel, i.e. σ of Gaussian.
Careful selection of σ and detection threshold values can produce satisfactory detection results on GT DMs with either of the traditional M conv or K sum strategies.However, the insensitivity of our proposed method (K max & M peak ) to both parameters in the DM design make it more suitable for detection on predicted DMs, since application of trained CNNs is expected to produce DM values and shapes differently to those observed for GT DMs.We subsequently train a UNet model for prediction of such DMs with the results shown in Fig. 2e, where detection quality is observed to highly depend on the detection threshold.While too small or too high thresholds are consistently worse, a high variation is observed across evaluated test samples, contrary to the clearer patterns observed for detection on GT DMs.

Probabilistic classification of cell proposals
Application of NMS with a DM-based threshold hinders cell detection by (i) introducing a hyperparameter that, as shown in the previous section, highly influences the detection quality and (ii) preventing a probabilistic interpretation of the predictions that can be linked to their confidence.We hereby propose an alternative that addresses these drawbacks without compromising the detection quality.First, we generate numerous cell proposals by using NMS on predicted DMs without applying a threshold (effectively setting the threshold to 0), which guarantees high recall but low precision because all possible peaks are considered as cells, as shown in Fig. 2e.Second, we aggregate summary statistics from the neighbourhood of each proposal as a feature vector.These features act as input to a binary classifier that, in contrast to the deterministic threshold previously employed, assigns each of the proposals a probability of being a cell.Our method hence provides a mapping from the regressed DM to a probability space for each identified cell.
As the binary classifier, we initially propose the use of a Random Forest (RF) [61].RFs have been described to be robust against overfitting, produce well-calibrated results, and provide accurate predictions when input features contain representative descriptions of the output labels [61,58].This latter characteristic is particularly suitable for our task, as we expect our predicted DMs to display information for easy discrimination of cell instances.In Fig. 3a we compare this RF classifier added onto the above baseline DM estimator, denoted as UNet+RF, with the baseline UNet where the DM threshold is selected from a validation set (details in Section Methods -Evaluation details).The results show that our proposed strategy can effectively remove the need for the said detection threshold, without sacrificing detection accuracy, i.e. significantly affecting the F 1-score.
Brier score [62] and negative log-likelihood (NLL) are scoring metrics employed for the evaluation of calibration quality (lower is better) of probabilistic predictions [63,50].These scores indicate that UNet+RF leads to a slight yet not significant improvement in calibration over UNet.

Bayesian regression networks for calibrated detections
Motivated by the advantages in calibration with the use of different Bayesian DL techniques reported for different domains [51,57,56,58], we adapt our framework to take uncertainties in DM regression into account, which we denote as Bayes and describe in Methods -Neural Networks for density map regression.This method produces, in addition to a predicted DM, also spatial epistemic and aleatoric uncertainty maps, which can also be provided to the classifier (RF) as additional input.While the Bayesian detection accuracy (F1-score) does not differ significantly from the UNet results, either with or without a probabilistic classifier (RF), it is seen in Fig. 3a that our proposed classifier combination with a Bayesian network (Bayes+RF) substantially improves the calibration quality seen in Brier score and NLL.
Although we employ RF due to its advantages described in the previous section, we show in Fig. 3b that replacing it with other binary classifiers, namely a multilayer perceptron (MLP ) or a CNN, produces similar results.Omitting uncertainties and employing RF only on features created from the DM (RF w/DM ) does not entail deterioration of detection quality nor calibration, implying that for Bayesian networks the prediction DMs alone hold the information necessary for inferring prediction probability.However, as seen in Fig. 3a, the Bayesian output alone (Bayes) is not well calibrated, and has an even lower Brier score and NLL than UNet+RF.These results indicate that the use of our binary classifier is key to obtain accurate confidence estimations.But even if the processing of uncertainties is not fundamental, this classifier truly excels in calibration when employed on Bayesian predictions (Bayes+RF or RF w/DM ), where the Brier score and NLL are significantly lower than for UNet+RF.Overall, the best results in detection F 1-score and calibration (as measured by Brier score and NLL) are achieved with Bayes+RF on feature vectors including uncertainty statistics, a method setting which we use for further experiments.

Qualitative assessment of detection proposals
To better understand the behaviour of our presented deterministic and probabilistic models we include in Fig. 4 visualizations of example predictions on the test set, with relevant findings marked with

numbered arrows.
Examples marked with arrows 1 show the benefits of probabilistic models on several predictions that are mistaken in their deterministic counterpart, either by assigning probabilities below 1 to FP (1a) or above 0 for FN (1b).While in the best cases these predictions are turned respectively into TN (p < 0.5) or TP (p ≥ 0.5), simply accounting for their probabilities lowers the confidence of wrong predictions, a benefit confirmed by the lower Brier score and NLL of probabilistic models (Fig. 3a).
We also report a number of observed failure cases.Some FP marked with arrows 2 appear as TP given the input image, and these may indeed have been mistakenly overlooked during GT annotation.Interestingly, all such observed examples have a relatively low prediction confidence (0.5 ≤ p < 0.75), potentially explaining or emulating the annotator's oversight.FP marked with arrows 3 are due to the presence of an artery: a structure unobserved in the training set that displays high intensities even in the absence of cells.Other errors are due to the prediction of large density blobs, either producing multiple predictions for a single GT (arrow 4) or producing a local maximum too far from its respective GT (arrow 5).

Probabilistic spatial characterization of bone marrow stromal cells with calibrated cell detection
In this section we deploy our calibrated cell detection framework (illustrated in Fig. 5) to revise the quantification of CAR cell spatial distributions in the context of BM stroma presented in [14], where an analysis of densities of this cell type and its spatial associations were evaluated with deterministic conventional image processing (CIP) methods that limited the statistical power of the observations.The use of spatial point processes was proposed to describe patterns with cumulative distribution functions (CDFs) of empty space distances (ESDs) and distances from CAR cells, which we apply here (details in Methods -Pipeline for quantification of bone marrow stroma) to describe their distribution relative to sinusoids and arteries segmented in a separate study [59] for diaphysis and metaphysis, two distinct BM regions shown in Fig. 6a.We propose two different types of analysis, for which the probabilities from Bayes+RF predictions are treated differently.A deterministic result is extracted for each BM sample by analyzing only cells predicted with p ≥ 0.5, which contribute equally to the analysis regardless of their prediction confidence.Alternative results are extracted with a probabilistic strategy that repeats the experiment multiple times (herein 50), each time sampling cell proposals with a probability given their predicted p.The contribution of each individual cell to this probabilistic analysis hence becomes proportional to its associated confidence, which we hypothesize to statistically be more rigorous than the deterministic interpretation; as also evaluated with the experiments below.
A comparison of the observations reported with CIP in [14] with the deterministic and probabilistic analyses hereby presented is shown in Table 1.The deterministic interpretation of the outputs achieved with Bayes+RF already increases the statistical power of the previous quantification with CIP, as our presented method allows for automation, applicability to more samples, and better detection metrics.Application of our probabilistic analysis reveals that, relative to the deterministic results, there is an increase in the global density of cells and a decrease of them in the vicinity of both sinusoids and arteries.The probabilistic interpretation of ESD implies random sampling of the empirical values, hence the similarities to deterministic results in %Volume and between ESD CDFs.Although a sample-wise aggregation of the values is included for comparison with earlier works, we also aggregate them for the entire dataset as a final quantitative report emphasizing that only a probabilistic analysis accounts for the uncertainty of the predictions.
The application of deterministic analysis in the characterization of distances through their CDFs, shown in Fig. 6b-d, confirms the previously reported trend of the preferential localization of CAR cells in close proximity to sinusoids, both in diaphysis and metaphysis.Furthermore, the access to segmented arteries unavailable in [14] allows to describe a much subtler tendency of CAR cells to also be preferentially residing within the vicinity of these specialized vascular structures.The application of the probabilistic analysis permits the extraction of these curves as confidence intervals (envelopes) reflecting the confidence in results when taking into account prediction probabilities.Strikingly, the deterministic curves lie outside these envelopes for large spatial distance ranges.This suggests that a probabilistic analysis is not only beneficial in accounting for uncertainty, but also in potentially revealing patterns only visible by considering confidence associated to the model's predictions.

Discussion
In this study we propose a DL-based cell detection framework designed for its application on 3D FM datasets that addresses several problems in previous DM regression methods for detection.Additionally, we introduce a method for well-calibrated probabilistic predictions that can be incorporated in powerful statistical analysis frameworks.
We first argue the importance of adapting existing tiling strategies to take a supplementary margin into account (M peak ), without which any detection results are inherently flawed, even upon perfect DM predictions, due to their merging at patch boundaries.Compounding annotations as K max , a technique that, to the best of our knowledge, had not been adopted in cell detection to date, is also found to be key for achieving accurate results without the influence of σ in the DM design.When predicting DMs with CNNs, the density threshold employed for proposals becomes a major limitation, as it may lead to unexpectedly inferior detection results if not diligently analyzed.Removing this threshold hyperparameter without compromising the detection quality is by itself a fundamental advantage of our probabilistic detection approach.Indeed, accounting for model uncertainties with deep Bayesian methods results in the additional benefit of better calibrated predictions, as evaluated by their superior Brier score and NLL.In such setting, we show that the classification of cell proposals is similarly effective with a number of different machine learning classifiers, suggesting that predicted Table 1: Analysis results on bone marrow stroma dataset as reported with CIP and the deterministic and probabilistic models proposed herein.Adjacency refers to distances smaller than 4 µm.Standard deviation (SD) is measured across samples in sample-wise rows, and across prediction replicates (n=50) in aggregated dataset rows.All results are shown as mean±SD.The reported detection metrics have a deterministic interpretation only, and are hence similar for both deterministic and probabilistic analyses.Nevertheless, the calibration metrics indicate that our probabilistic method estimates much better the chances of a result being accurate.More details on these metrics can be found in Methods -Evaluation details.DMs already pose a simple classification problem and that the key contribution resides in our proposed pipeline design rather than the specific classifier employed.A relevant question for future work is whether the number of features in the selected RF method can speed up the detection process without compromising the results.

Method
Although accounting for uncertainties in the regression CNN improved the calibration of the predictions, including them in the features for the probabilistic classifier did not significantly impact any evaluated metric.This observation may imply that DM values can somehow reflect uncertainty information that is subsequently exploited by the classifier, consistent with the findings in [51] that simply accounting for uncertainty during training of CNNs improves calibration of the results, without the need for post-processing them.However, uncertainties did not seem to capture information about specific unseen structures (Fig. 4) as has been the case in previous reports [51,58].Whether this is a limitation caused by our dataset, the chosen Bayesian techniques, or the use of DM as a proxy for detection is worth investigating in future research.In particular, although not applied herein due to their impracticable computational complexity, deep ensembles that have produced more accurate uncertainties in other tasks [64] may shed light on and provide improvements to our Bayesian approximations.
We believe that our probabilistic cell detection framework offers a paradigm-shift in analysis tasks.Although the reported throughput and detection metrics already favor the use of the deterministic interpretation of the outputs achieved with Bayes+RF as compared to the previously employed CIP method [14], it is the superior calibration metrics that justify the proposed probabilistic analysis as a superior option to the more common deterministic alternative.As we showcase with the numbers reported for BM stroma, a deterministic interpretation of the results does not contemplate mistakes in predictions.Instead, accounting for output confidences in a probabilistic analysis entails a weighting of the contribution from each prediction to the final results by its associated probability, which, in correctly calibrated models, corresponds to their likelihood of being accurate.The application of this analysis for the characterization of the spatial distribution of CAR cells within the BM has confirmed our previous results and corroborate the preferential localization of this cell type in close association to sinusoidal vessels reported in [14].However, the revised results presented herein not only provide a higher statistical power and additional confidence estimates, but also reveal substantial differences in specific parameters defining spatial distributions of CAR cells.For instance, the probabilistic interpretation indicates that the amount of CAR cells adjacent to sinusoids is 15.42±13.18%and 14.61±9.86%lower than in the deterministic results for the diaphysis and metaphysis In this way, in addition to the robust hypothesis testing techniques already employed in [14], we hereby take into account variability in our cell detection predictions.In fact, this latter variability is estimated from the uncertainty implicit to the method (epistemic) and the uncertainty arising from the observations (aleatoric), such that we account for all possible sources of error in our analysis.Therefore, we trust that the proposed cell detection and probabilistic analysis strategies will become the basis for future studies of spatial distributions.
The effect of a probabilistic analysis in a segmentation setting (e.g.sinusoids or arteries) will be a relevant question worth investigating in future work.However, the sampling strategy proposed herein for detection coordinates would not be suitable for representing uncertainty for segmentation tasks, i.e. sampling pixels cannot simulate ambiguities, e.g. in extents along different dimensions, morphology, etc.Consequently, novel methods will need to be devised to establish connections between uncertainty maps and a consistent probabilistic analysis.
Although instance segmentation methods can be employed for similar analyses [65,66,67], detection frameworks have the benefit that manual labeling of coordinates is substantially faster than individual pixels, especially in 3D data.Furthermore, while other detection methods employed for the prediction of bounding boxes consist of multi-stage approaches that require several post-processing steps [41], we employ a DM regression approach that, in its standard form, only requires a thresholdbased NMS [42].With our contributions, such a threshold hyperparameter is further eliminated with two separate supervised models that constitute an end-to-end framework without compromising the detection accuracy.The additional benefit of our method predicting well-calibrated probabilistic cell proposals is shown to offer a more comprehensive analysis enabling confidence intervals for any hypothesis testing, which we believe will be the base of future image-based quantification in different FM and other imaging datasets.

Dataset and tiling strategy
We evaluate our detection strategies on BM samples from the dataset presented in [14].The CXCL12-GFP channel that expresses high intensities for CAR cells is employed as input image.Cells are manually annotated by their central coordinates in 7 different samples with the properties in Supplementary Table 1.
Since large FM volumes cannot fit in the memory of GPUs required for fast application of CNNs, we decompose each sample into a number of 3D patches following a previously reported tiling strategy for 2D images [59].This method ensures that output patches, which are smaller than input patches upon application of CNNs, can be seamlessly reconstructed.In summary, Gaussian normalization (subtracting the mean and dividing by SD) is first applied to all samples, before resizing them to the isotropic resolution of 1 µm/voxel and zero-padding with a margin l pad .Input patches x ∈ R l in are then extracted with an overlap of l overlap between them.Output patches y ∈ R lout are generated as GT DMs from the coordinates annotated within x as detailed in next subsection for supervised training of CNNs.l out is determined by the pixels subtracted from l in by convolution operations within the regression CNN.
A final output y tile ∈ R l out tile is obtained from y depending on the tiling strategy employed.The M conv strategy uses y tile = y, and M peak subtracts a supplementary margin of 4 µm to avoid duplicated predictions in neighbour patches.l overlap is calculated such that the resulting y tile are adjacent to each other, except at the sample border, where an additional overlap is permitted to ensure completion of the entire sample volume.The sizes employed for each of these strategies are reported in Supplementary Table 2. l in is selected to create input patches that are as big as possible while fitting in the GPU memory and being divisible at the different downsampling layers by accounting for the pixels lost due to the corresponding convolutional layers.Smaller l in results in a higher ratio of borders in the dataset, which can negatively affect the detection results when employing M conv tiling.However, as shown in Fig. 2b, the proposed M peak strategy employed in the rest of the experiments is invariant to such border artifacts, and hence l in may only impact the computational speed.
For the training and evaluation of models, 20% of the within the complete dataset D are split as a separate test set D test where the different methods are evaluated.Of the remaining samples (D\D test ), 80% are randomly selected for inclusion in a training set D train employed for the training of DM regression models, and 20% in a validation set D val where different hyperparameters of these models are evaluated.This separation into training and validation from D\D test is repeated with the same ratio to create a parallel split Dtrain and Dval , which are employed distinctly in the training of binary classifiers, as explained in the next sections.

Density map generation
Let C ∈ R Nc×3 be the annotated GT locations within an image x, where N c is the number of annotations.Let each location be denoted as c ∈ R 3 , and G σ the Gaussian function be defined as: Compounding G σ using the K sum strategy commonly employed in previous works results in a DM y with the value at each pixel k ∈ R 3 calculated as: considering points only within a radius l g =16 µm for computational efficiency, since points further away will have an insignificant contribution.
In the K max strategy, the different G σ associated to each of the coordinates is compounded by their maximum: Coordinates are localized in the resulting DMs by application of NMS with a minimum distance between peaks of 4 µm.This value is chosen as the average cell radius, which was also observed in the training data to be smaller than the minimum distance between any GT annotation points.

Neural Networks for density map regression
The UNet model f UNet employed herein is based on the one proposed in [28], which we adapt to 3D by replacing 2D with 3D convolutional layers, with less feature channels in order to compensate for the capacity increase related to the use of 3D kernels.In addition, we include residual connections [29] for each convolutional block.We define a convolutional block with q nodes as: where ReLu is the Rectified Linear Unit [68], a is an activation at any level of the CNN, and h q is a 3D convolutional layer with a 3 × 3 × 3 kernel and q channels.Denoting 3D max pooling layers with 2 × 2 × 2 kernel as MP and upsampling layers with the same kernel as UP, we build UNet composed of the following layers: C 16 MP C 32 MP C 64 UP C 32 UP C 16 h 1 .Skip connections are added to connect blocks with the same resolution as described in [28].This UNet is trained with L2 loss on predictions ŷ = f UNet (x) as: The Bayesian UNet f Bayes employs the same underlying architecture as f UNet , but different training and inference strategies, following the design choices justified in [58].The last layer is changed from h 1 to h 2 to take aleatoric uncertainty u a into account, which is calculated simultaneously with the DM as [ŷ, u a ] = f Bayes (x).A ReLu activation applied to u a to ensure positive values.The training loss function then becomes: In addition, MC dropout is used to take into account epistemic uncertainty u e by randomly (with a uniform probability of 0.2) deleting some of the convolutional layers h except the last one [52].With the output being stochastic, 50 samples [ŷ t , u t a ] are drawn at inference to calculate the final ŷ.For this random sampling, u a is used as the respective detection mean, and u e as the pixel-wise SD across the 50 samples ŷt .
All CNNs in this work are implemented with TensorFlow v2.3 [69] and deployed on an NVIDIA GeForce GTX TITAN X GPU with 12 GB of VRAM.A batch size of 4 is used, which is the maximum size that fits the GPU memory available.All models were trained with Adam optimizer [70] with a learning rate of 10 −3 .

Probabilistic classification of cell proposals
For the application of probabilistic classifiers, Nc cell proposals Ĉ ∈ R Nc×3 are generated from DMs by application of a threshold-free (setting it to 0) NMS.A feature vector v c is then generated by extracting summary statistics for volumes with different sizes (4, 8, 16 and 32 µm 3 ) centered on each of the ĉ ∈ Ĉ for the available predictions (DM, u a , and u e ).The summary statistics were heuristically selected as follows: • 5 percentiles for each of the volumes considered, which are taken uniformly in the range from the 1 st to the 99 th .
• The ratio of voxels above 5 different thresholds t, selected uniformly in ranges that differ for each of the volumes considered: t ∈ [1, 1.5] for DM, t ∈ [1, 10] for u a , and t ∈ [1, 0.2] for u e .
Binary classification models g are trained on the features v c to learn whether each proposal corresponds to a cell.Following the dataset notation in Section Methods -Dataset and tiling strategy, RF classifiers (g RF ) are trained with 128 trees and gini criterion on Dtrain ∪ Dval .The validation set is not employed separately because no hyperparameters need to be tuned.MLP classifiers (g MLP ) are designed with 4 hidden layers (with 50, 50, 20 and 20 nodes respectively) with ReLu activations.They are trained for 200 epochs with NLL loss and Adam optimizer on Dtrain .The epoch with best accuracy on the validation set Dval is selected.
To provide and compare with a classifier alternative that is independent of the hand-crafted features v c used above, we also used a CNN classifier g CNN that acts directly on a 40 × 40 × 40 voxels region surrounding each proposal ĉ.These regions are concatenated as channels from the three output volumes: DM, u a and u e .Following the notation in Methods -Neural Networks for density map regression, g CNN is implemented with a number of convolutional blocks C q (a) = ReLu (h q (ReLu (h q (a)))) and fully connected layers FC q with q channels, each as C 32 C 64 C 128 FC 2048 FC 2048 FC 1 .Similarly to MLP above, this CNN classifier is also trained on Dtrain .We used Adam optimizer with a learning rate of 10 −4 , and NLL loss for 20 epochs.The epoch with the best accuracy on Dval is then selected.
These binary models output values p in the range [0, 1] that can be interpreted as probabilities.Therefore, we consider as positive predictions those with p ≥ 0.5.In the case of baseline thresholdbased detections (UNet and Bayes in Fig. 3), the threshold is selected on the validation set D val employed in the training of the regression CNNs.Hence, we also study the effect of selecting a threshold probability p from the validation set Dval for the RF binary classifier employed.Supplementary Fig. 3 shows that in this case, RF (denoted as RF(p ← Dval )) performs similarly in detection F 1-score to the RF employed in the rest of this work, which, as explained above, does not employ a separate validation set.

Evaluation details
Coordinates from patches are reconstructed in the context of samples by reverting the tiling strategy in Methods -Dataset and tiling strategy prior to their evaluation.Subsequently, a pairing function λ is defined that matches each GT coordinate c to a predicted coordinate ĉ = λ(c) by optimizing the linear sum assignment of their respective Euclidean distances: This equation is solved with the Hungarian algorithm [72], with some example cases illustrated in Supplementary Fig. 1.
A TP is considered for pairs c−λ(c) 2 ≤ t match , where t match is a distance threshold.We set t match to 4 µm, the average expected cell radius and the same parameter employed as the minimum distance separating peaks in NMS (see above in Section Methods -Density map generation).Assignments c − λ(c) 2 > t match produce a FP for the prediction and a FN for the GT.A FN is counted when no prediction is assigned to a GT (λ(c) = ∅), and a FP when no GT is assigned to a prediction (λ −1 (ĉ) = ∅).From this strategy, we calculate the Precision, Recall, and F 1-score as: Precision = T P T P + F P , The calibration of the classification models is evaluated by taking into account the probabilities of predicted coordinates p(ĉ) estimated by the classifiers described in the previous section.Coordinates predicted by deterministic models are assigned p(ĉ) = 1 for TP and FP, and p(ĉ) = p(λ(c)) = 0 for FN.Although GT coordinates do not strictly have an associated probability, we denote them as such for convenience.Therefore, annotated GT coordinates have p(c) = 1, and FP are counted as p(c) = p(λ −1 (ĉ)) = 0. Calibration is assessed with the Brier score and NLL as follows:

Pipeline for quantification of bone marrow stroma
The quantification of CAR cells described in Section Results -Probabilistic spatial characterization of bone marrow stromal cells with calibrated cell detection is performed on an extension of the dataset provided in [14], which includes samples that the CIP methods employed in the original work could not successfully analyze.Note that manually labeled samples employed for the evaluation of the detection methods form a subset of this data.This dataset was also employed in [59] using CNNs to segment sinusoids, arteries, and a tissue mask, which we utilize in our current analysis.Sinusoids and arteries are used herein as reference structures to evaluate potential spatial distributions of CAR cells relative to them.The tissue mask defines the voxels inside the specimen where must occur, such that out-of-tissue empty regions are ignored in spatial analysis.
As proposed in [14], cellular distributions are analyzed by employing CDFs of distances to segmented structures.The ESD aggregates distances from all background voxels to their respective closest foreground voxel (herein sinusoids or arteries) and its CDF describes the proportion of volume left for cells to distribute at different distances.We compare distances from predicted cell locations to a segmented structure using the ESD of the latter in CDF form.This comparison allows for statistical testing of patterns in the localization of cells relative to the ESD.Namely, an attraction pattern is suggested when the CDF of distances from cells is above that of a baseline ESD; and conversely an avoidance pattern, when it is below.
Analysis is performed on a subset of cells C obtained from the predictions Ĉ proposed by a trained Bayes+RF.In a deterministic analysis, all predictions with probabilities of at least 0.5 are considered: C = {ĉ | p(ĉ) ≥ 0.5, ∀ĉ ∈ Ĉ}.
To incorporate the effect of cell predictions with different probabilities, the above analysis is repeated multiple times (herein T = 50), which according to a Monte-Carlo interpretation of simulations [15] permits for testing differences with a significance level α = 2/(T + 1) = 0.04.Each probabilistic ESD is calculated by random sampling from the original ESD of a number voxels w calculated from a Poisson distribution as w ∼ Pois( Ñc ), as described in [15].CDF envelopes in this setting are created as in [14] by (i) estimating the probability density function of the distances with Gaussian kernels and Scott method [73], (ii) evaluating the CDF at the same distance values for all replicates, and (iii) employing the maximum and minimum values at each evaluated distance as the upper and lower envelope lines respectively.Su pplementary Figure 3: Comparison of probabilistic models with RF using different criteria for for thresholding the probability p.Under the probabilistic interpretation, p can be set to 0.5, which is named RF(p=0.5).For comparison with threshold-based detection methods (UNet or Bayes) where this threshold t is selected from the validation set (t ← D val ), we also consider a RF where the threshold t is selected from its corresponding validation set as RF (p ← Dval ).These results confirm that detection results are not significantly different with either of these methods, although the probabilistic interpretation of p = 0.5 produces slightly superior results.

Figure 1 :
Figure 1: Illustration of our pipeline for cell detection.Large 3D FM samples are decomposed in patches (green frame) subsequently processed by a CNN that regresses an output DM.A peak detection algorithm is applied on the DM to obtain the locations of cells.The resulting coordinates are then reconstructed within the original large sample volume based on the limits of the output patch (red frame) where they are contained.

Figure 2 :
Figure 2: Analysis of DM design alternatives for regression-based cell detection.(a) Example slice of DMs created with Gaussian kernels with different σ values.White rings in the input image denote GT out-of-plane coordinate with a ring size proportional to proximity to the displayed slice, i.e. smaller rings correspond to annotations in more distant slices.For larger σ, out-of-plane annotations are seen to project onto the slice DM. (b-e) Comparison of different detection algorithmic choices and thresholds for the application of NMS on GT DMs (green) and DMs predicted with UNet (red).Yellow lines on the images represent the patch size of CNN output, equivalently the tiling boundary in the Mconv strategy.Magenta lines depict the patch size for the M peak strategy.GT annotations are represented as rings (•), and predictions as crosses (×).Both • and × are green when they form a TP pair, or red for FP or FN.The yellow arrows mark an example where M peak helps, and the blue arrows another where Kmax helps.The graphs show the detection metrics as the mean (solid line) and 95% confidence interval across samples (n=7) in the test set across detection thresholds, for each tested experimental setting and σ value.For the upper bound GT cases in (b-d), one can see in the results (e.g. in F1-score) that the proposed (Kmax&M peak ) method is invariant to large ranges of σ and threshold parametrization.

Figure 3 :
Figure 3: Prediction metrics for the different proposed cell detection models.(a) Comparison between UNet and its Bayesian implementation Bayes when selecting the DM threshold from the validation set, and the respective models when adding a probabilistic binary classifier RF (+RF ).Brier score and NLL evaluate probabilistic predictions (lower is better).Since threshold-based models are deterministic, they are evaluated by assigning a fixed probability of 1 to positive predictions (TP or FP), and 0 to FN (shown in stripes to indicate this evaluation convention).(b) Comparison of different probabilistic classifier strategies added to Bayesian UNet.All models are evaluated by 4-fold cross-validation on the test set (n=7).Significance is indicated with p-value≤0.05(*).

Figure 4 :
Figure4: Visualization of predictions with different proposed models on example 2D slices of the test set.GT coordinates are represented by rings (•) and predictions by crosses (×), with a size proportional to their proximity to the displayed slice.In deterministic models (UNet and Bayes) coordinates are coloured green when they have a positive match (TP) and red when they do not (FP for ×, FN for •), according to Hungarian matching.In probabilistic models (UNet+RF and Bayes+RF) the predictions are coloured according to their associated probability (colorbar at the right), whereas GT annotations follow the deterministic colouring scheme, considering positive predictions as those with p ≥ 0.5.Arrows mark specific examples discussed in the text according to their accompanying number.

Figure 5 :
Figure5: Representation of our proposed method for probabilistic cell detection.Input patches are processed by a Bayesian CNN (Bayes) to regress an output DM and its corresponding uncertainties.These outputs are employed by a classifier (RF ) to assign a probability or confidence to a large number of cells proposed by application of peak detection on the DM.The resulting proposals can be sampled according to their assigned probability.

Figure 6 :
Figure 6: Quantification of spatial distributions in the BM stroma with our methods producing calibrated cell detections.(a) Example visualizations of BM diaphysis and metaphysis regions, with the intensities employed for detection of CAR cells (top) and the computational representation (segmentation and detection) of the different structures employed in the analysis (bottom).(b) CDF of distances to sinusoids and arteries for both BM regions, zoomed-in in c to highlight the confidence ranges of the probabilistic analysis.Dashed lines are employed for deterministic results.Probabilistic results are represented as envelopes containing the maximum and minimum CDF values for each distance (n=50).(d) Cross section of the CDF at a distance of 4 µm (dashed blue line in c) to emphasize the differences between probabilistic measures capturing a confidence range (depicted as |-| ) with their deterministic counterpart that calculates a single value (depicted as ×).
p(c) • log p λ(c) + 1 − p(c) • log 1 − p λ(c) .All models were trained for 200 epochs on the training set D train and the epoch with the highest F 1-score on the validation set D val is selected for posterior evaluation on the test set D test .Results are calculated for each sample in the test set, following a 4-fold cross-validation strategy.