Page 31 - TD-3-1
P. 31

Tumor Discovery                                                       AI uncovers tumor spatial organization



            Visium technology, the radius-nearest neighbor mode is   Where c is the number of cell types, n and n  denote
                                                                                                 i
                                                                                                       .j
            used to ensure each spot is in proximity to approximately   the number of spots belonging to P  and G, and n implies
                                                                                                       ij
                                                                                           i
                                                                                                 j
            five neighbors. Conversely, the k-nearest neighbor mode   the number of spots located in P  and G. A higher ARI
                                                                                                j
                                                                                          i
            is employed to guarantee that each spot has precisely five   indicates a greater similarity between the two groups, with
            neighbors for the 10× Xenium method.               all ARI values ranging from zero to one.
              On generating the feature and adjacency matrices, we   If the ground truth for the ST data is not available, the
            utilize the VGAE module embedded with SGC ConvNets   Silhouette Coefficient (SC) score and Davies-Bouldin (DB)
            to learn latent embeddings through model training. This   score  are employed to assess the clustering performance.
                                                                   29
            step is implemented in Python using PyTorch_pyG. The   The SC score is computed based on the mean intra-
            relevant  hyperparameters  are defined as  follows:  input   cluster distance and mean nearest-cluster distance for
            channels  are  3000  (representing  the  number  of  highly   the predicted labels, ranging from minus one to one. It
            variable genes), hidden channels are 128, and output   signifies the dispersion level between clusters, and a higher
            channels are 128. The learning rate is set to 1e-6, the number   SC score indicates better clustering accuracy. The DB score
            of epochs is 5000, the weight decay factor is 1e-4, gradient   represents the average similarity measure of each cluster
            clipping is set at five, and the random seed is fixed at zero   with the most similar cluster, ranging from zero to positive
            for all experiments. The model architecture comprises four   infinity. A lower DB index is preferred. These metrics are
            hidden layers, and the activation function used outside   calculated using the scikit-learn package.
            each layer is the exponential linear unit function.
              After generating the latent representations in the   3. Results and discussion
            embedding space, we employ the K-means method as the   3.1. Evaluating the clustering accuracy on human
            downstream  clustering  approach  for  identifying  spatial   dorsolateral pre-frontal cortex data
            domains. The clustering process is facilitated using the   We named our proposed spatial clustering method VGAE_
            scikit-learn package. The number of clusters in the K-means   SGC. To demonstrate its accuracy and effectiveness, we
            corresponds to the ground truth. The hardware utilized in   compared this method with six benchmark approaches:
            this study includes an Intel (R) Core (TM) i9-12900F CPU   BayesSpace,  SpaGCN,  STAGATE,  SEDR,  Scanpy,
                                                                                             20
                                                                                                            27
                                                                        15
                                                                                                    26
                                                                                  18
            at 2.40 GHz, 64 GB of memory, and a GeForce RTX 3090Ti   and DeepST.  The evaluation was performed on the ST
                                                                         30
            GPU. To effectively run the GNN-based spatial clustering   data using sample 151673 from the human DLPFC dataset,
            algorithms, the maximum number of spots should be   and we calculated and compared the average ARI values
            below 20,000. Consequently, for the ST data from the 10×   across 12 samples. In the ground truth of sample 151,673,
            Xenium technology, we utilize a cropped subset of the   there were seven marked groups (six cortical layers and
            data, ensuring it remains below this threshold. Given that   one white matter).
            this subset encompasses all genes and groups, it does not
            compromise the spatial clustering experiments.       The  VGAE_SGC  approach  exhibited  the  highest
                                                               average ARI of 0.542 (Figure 2A). In addition, STAGATE
            2.5. Spatial clustering metrics                    and DeepST achieved ARI values exceeding 0.50. The
            On acquiring the prediction labels using K-means, we   ground truth of sample 151673 is illustrated in Figure 2B,
            employ the adjusted Rand index (ARI) to assess the   with the spatial domain identification results closely
            similarity between these predicted labels and the ground   aligned with this ground truth (ARI = 0.5253, Figure 2C).
            truth. The ARI is a commonly utilized metric for evaluating   The spatial clustering results of the six compared methods
            clustering algorithms. It is calculated using the scikit-learn   are presented in Figure 2D, displaying ARI values lower
            toolkit by comparing the two vectors. Assuming that   than 0.50 for this sample. Through this comparison with
            P = {P ,P ,… P } and G = {G ,G ,… G } represent the   the ground truth, we established the clustering accuracy
                    2
                                        2
                                             M
                  1
                                     1
                         M
            predicted and ground-truth label sets, the ARI is defined   of our proposed method. Subsequently, we validated
            as follows (Equation VII):                         this method using tumor ST datasets to assess its spatial
                                                               clustering capabilities.
                                                  n n
                            n     n    n             3.2. Deciphering multiple regions of human breast
                                              j .
                                      i.
                             ij

                          ij      i    j    /



                                                  2
                        ,    2        2      2       cancer from low-resolution technology
            ARI
                                                        n
                  1     n      n     n   n  j .      The  analysis involved seven  distinct clustering methods
                                  j .
                         i.
                                            i.
                                                      /



                                                        2
                  2    i    2     j    2      i       2    j     2      applied to the human breast cancer dataset obtained from




                                                               10× Visium. This ST dataset, manually annotated using the
                                                      (VII)    SEDR package, comprises 20 regions encompassing four
            Volume 3 Issue 1 (2024)                         5                          https://doi.org/10.36922/td.2049
   26   27   28   29   30   31   32   33   34   35   36