Page 31 - TD-3-1
P. 31
Tumor Discovery AI uncovers tumor spatial organization
Visium technology, the radius-nearest neighbor mode is Where c is the number of cell types, n and n denote
i
.j
used to ensure each spot is in proximity to approximately the number of spots belonging to P and G, and n implies
ij
i
j
five neighbors. Conversely, the k-nearest neighbor mode the number of spots located in P and G. A higher ARI
j
i
is employed to guarantee that each spot has precisely five indicates a greater similarity between the two groups, with
neighbors for the 10× Xenium method. all ARI values ranging from zero to one.
On generating the feature and adjacency matrices, we If the ground truth for the ST data is not available, the
utilize the VGAE module embedded with SGC ConvNets Silhouette Coefficient (SC) score and Davies-Bouldin (DB)
to learn latent embeddings through model training. This score are employed to assess the clustering performance.
29
step is implemented in Python using PyTorch_pyG. The The SC score is computed based on the mean intra-
relevant hyperparameters are defined as follows: input cluster distance and mean nearest-cluster distance for
channels are 3000 (representing the number of highly the predicted labels, ranging from minus one to one. It
variable genes), hidden channels are 128, and output signifies the dispersion level between clusters, and a higher
channels are 128. The learning rate is set to 1e-6, the number SC score indicates better clustering accuracy. The DB score
of epochs is 5000, the weight decay factor is 1e-4, gradient represents the average similarity measure of each cluster
clipping is set at five, and the random seed is fixed at zero with the most similar cluster, ranging from zero to positive
for all experiments. The model architecture comprises four infinity. A lower DB index is preferred. These metrics are
hidden layers, and the activation function used outside calculated using the scikit-learn package.
each layer is the exponential linear unit function.
After generating the latent representations in the 3. Results and discussion
embedding space, we employ the K-means method as the 3.1. Evaluating the clustering accuracy on human
downstream clustering approach for identifying spatial dorsolateral pre-frontal cortex data
domains. The clustering process is facilitated using the We named our proposed spatial clustering method VGAE_
scikit-learn package. The number of clusters in the K-means SGC. To demonstrate its accuracy and effectiveness, we
corresponds to the ground truth. The hardware utilized in compared this method with six benchmark approaches:
this study includes an Intel (R) Core (TM) i9-12900F CPU BayesSpace, SpaGCN, STAGATE, SEDR, Scanpy,
20
27
15
26
18
at 2.40 GHz, 64 GB of memory, and a GeForce RTX 3090Ti and DeepST. The evaluation was performed on the ST
30
GPU. To effectively run the GNN-based spatial clustering data using sample 151673 from the human DLPFC dataset,
algorithms, the maximum number of spots should be and we calculated and compared the average ARI values
below 20,000. Consequently, for the ST data from the 10× across 12 samples. In the ground truth of sample 151,673,
Xenium technology, we utilize a cropped subset of the there were seven marked groups (six cortical layers and
data, ensuring it remains below this threshold. Given that one white matter).
this subset encompasses all genes and groups, it does not
compromise the spatial clustering experiments. The VGAE_SGC approach exhibited the highest
average ARI of 0.542 (Figure 2A). In addition, STAGATE
2.5. Spatial clustering metrics and DeepST achieved ARI values exceeding 0.50. The
On acquiring the prediction labels using K-means, we ground truth of sample 151673 is illustrated in Figure 2B,
employ the adjusted Rand index (ARI) to assess the with the spatial domain identification results closely
similarity between these predicted labels and the ground aligned with this ground truth (ARI = 0.5253, Figure 2C).
truth. The ARI is a commonly utilized metric for evaluating The spatial clustering results of the six compared methods
clustering algorithms. It is calculated using the scikit-learn are presented in Figure 2D, displaying ARI values lower
toolkit by comparing the two vectors. Assuming that than 0.50 for this sample. Through this comparison with
P = {P ,P ,… P } and G = {G ,G ,… G } represent the the ground truth, we established the clustering accuracy
2
2
M
1
1
M
predicted and ground-truth label sets, the ARI is defined of our proposed method. Subsequently, we validated
as follows (Equation VII): this method using tumor ST datasets to assess its spatial
clustering capabilities.
n n
n n n 3.2. Deciphering multiple regions of human breast
j .
i.
ij
ij i j /
2
, 2 2 2 cancer from low-resolution technology
ARI
n
1 n n n n j . The analysis involved seven distinct clustering methods
j .
i.
i.
/
2
2 i 2 j 2 i 2 j 2 applied to the human breast cancer dataset obtained from
10× Visium. This ST dataset, manually annotated using the
(VII) SEDR package, comprises 20 regions encompassing four
Volume 3 Issue 1 (2024) 5 https://doi.org/10.36922/td.2049

