Page 103 - AIH-1-3
P. 103

Artificial Intelligence in Health                                 ISM: A new multi-view space-learning model



            be defined for the UMAP embedding of single-cell data,   7:  Disregard cluster as the main class does not constitute an absolute
            whereby a higher resolution leads to a higher number of   majority in relation to all elements of the same class;
            clusters. In addition, the subtle differences between some   8: Else
            cell types from one family can be smoothed out if the   9:  p=p ×p =purity corrected for cluster representativity for the main
            dataset contains transcriptionally distinct cell types from   class; 1  2
            multiple families, as is the case with immune cells for the   10: end for
            Signature 915 dataset.                              11:  Calculate the global purity=sum of corrected purities over all
              Latent space methods require that the rank of the    retained clusters, divided by the number of known classes;
            factorization is determined in advance. ISM benefits   2.3. Implementation
            from  the  advantages  of  the  NMF  and  NTF  workflow
                                                                        39
            components, that is, the choice of the correct rank is less   Scikit-learn  was used for K-means, ARI, NMI, MDS, and
            critical than with other methods (we will come back to this   PCA. The mvlearn (https://pypi.org/project/mvlearn/)
            point in the results [Section 3] and discussion [Section 4]   package was used for MVMDS. NMF and NTF were
                                                               performed with the package adnmtf (https://pypi.org/
            sections). This allows, even if we expect some redundancy   project/adnmtf/). ISM was implemented in Python and
            in the latent factors – for instance, due to the proximity   was invoked from a Jupyter Python notebook available
            of certain digits in the first dataset – to set the rank to the   on the Advestis GitHub (https://github.com/Advestis).
            number of known classes.                           GFA was performed with the Python package gfa-python
              The dimension of the ISM embedding space must also   (https://github.com/mladv15/gfa-python). MOFA+ was
            be determined during the discovery step. A natural choice   performed with the Python package mofapy2 (https://
            is the dimension of the latent space since both spaces are   github.com/bioFAM/mofapy2).  Matplotlib  (https://
            merged at the end of the ISM workflow. Nevertheless, by   matplotlib.org/stable/tutorials/pyplot.html) was used to
            examining the approximation error for an embedding   create the clustering figures. Treemaps were obtained with
            dimension in the neighborhood of the chosen rank, it is   the  Graph Builder platform from JMP® (Version 17.2.0.
            possible to further optimize the ISM representation.  SAS Institute Inc., USA). The distinctipy package (https://
                                                               pypi.org/project/distinctipy/) was used to generate colors
              The rank for PCA, MVMDS, GFA, and MOFA+ is set   that are visually distinct from one another.
            by inspecting the scree plot of the variance ratio.
              The analysis of the Signature 915 dataset also examines   3. Results
            the biological relevance of the distance between clusters in   We first present a synthesis of the calculated metrics across
            each latent multi-view space. Of the five datasets analyzed   all datasets (Table 1) and provide some general observations.
            in this article, only the Signature 915 dataset is a 3D array;   We then present more detailed results for each dataset.
            therefore, NTF is also directly applied to this particular   3.1. Synthesis of calculated metrics over all datasets
            dataset.
                                                               Based on the average index across all seven indices, ISM
              Detailed analysis steps are provided in Workflow 3.  ranks first in the UCI Digits, Signature 915, and Reuters
            Workflow 3. Analysis steps                         datasets, while ILSM ranks first in the prokaryotic dataset
                                                               and the TEA-seq multi-omic single-cell dataset (although
            Input: 2D map projection of the data transformation in the latent space.  very close to ISM for the latter dataset, 0.80  vs. 0.79,
            Output: Cluster purity index.                      respectively). It is easy to explain why ILSM performed
             1: Perform K-means with k equal to the number of known classes;  much better than ISM on the prokaryotic dataset
             2:  For each cluster, identify the main class related to the cluster, that   (0.52  vs. 0.37, respectively): Since ISM first performs
               is, the class corresponding to the majority of observations in the   a global factorization over concatenated views (Unit  1
               cluster;                                        of  Workflow  1),  it  tends  to  ignore  the  smallest  views
             3:  Merge contiguous clusters that refer to the same class or ignore   when they are extremely unbalanced, as is the case in
               them if not contiguous;                         the prokaryotic dataset. However, when using ILSM,
             4: for each cluster do                            separate factorizations are applied to each view, and ISM
             5:  p =proportion of the main class in relation to all elements in the   itself is applied to transformed views of equal size. As a
                1
               cluster;                                        result, the original views with the smallest size are given
                 p =proportion of the main class in cluster c in relation to all   equal weight. Among the criteria used, the proportion of
               2
              elements of the same class;                      classes retrieved, purity, and sparsity indices are the most
             6: If p  < 0.5 then
                 2                                             discriminative.  It  is  noteworthy  that  NMF  performs  as

            Volume 1 Issue 3 (2024)                         97                               doi: 10.36922/aih.3427
   98   99   100   101   102   103   104   105   106   107   108