Page 103 - AIH-1-3

P. 103

Artificial Intelligence in Health ISM: A new multi-view space-learning model

be defined for the UMAP embedding of single-cell data, 7: Disregard cluster as the main class does not constitute an absolute
whereby a higher resolution leads to a higher number of majority in relation to all elements of the same class;
clusters. In addition, the subtle differences between some 8: Else
cell types from one family can be smoothed out if the 9: p=p ×p =purity corrected for cluster representativity for the main
dataset contains transcriptionally distinct cell types from class; 1 2
multiple families, as is the case with immune cells for the 10: end for
Signature 915 dataset. 11: Calculate the global purity=sum of corrected purities over all
Latent space methods require that the rank of the retained clusters, divided by the number of known classes;
factorization is determined in advance. ISM benefits 2.3. Implementation
from the advantages of the NMF and NTF workflow
39
components, that is, the choice of the correct rank is less Scikit-learn was used for K-means, ARI, NMI, MDS, and
critical than with other methods (we will come back to this PCA. The mvlearn (https://pypi.org/project/mvlearn/)
point in the results [Section 3] and discussion [Section 4] package was used for MVMDS. NMF and NTF were
performed with the package adnmtf (https://pypi.org/
sections). This allows, even if we expect some redundancy project/adnmtf/). ISM was implemented in Python and
in the latent factors – for instance, due to the proximity was invoked from a Jupyter Python notebook available
of certain digits in the first dataset – to set the rank to the on the Advestis GitHub (https://github.com/Advestis).
number of known classes. GFA was performed with the Python package gfa-python
The dimension of the ISM embedding space must also (https://github.com/mladv15/gfa-python). MOFA+ was
be determined during the discovery step. A natural choice performed with the Python package mofapy2 (https://
is the dimension of the latent space since both spaces are github.com/bioFAM/mofapy2). Matplotlib (https://
merged at the end of the ISM workflow. Nevertheless, by matplotlib.org/stable/tutorials/pyplot.html) was used to
examining the approximation error for an embedding create the clustering figures. Treemaps were obtained with
dimension in the neighborhood of the chosen rank, it is the Graph Builder platform from JMP® (Version 17.2.0.
possible to further optimize the ISM representation. SAS Institute Inc., USA). The distinctipy package (https://
pypi.org/project/distinctipy/) was used to generate colors
The rank for PCA, MVMDS, GFA, and MOFA+ is set that are visually distinct from one another.
by inspecting the scree plot of the variance ratio.
The analysis of the Signature 915 dataset also examines 3. Results
the biological relevance of the distance between clusters in We first present a synthesis of the calculated metrics across
each latent multi-view space. Of the five datasets analyzed all datasets (Table 1) and provide some general observations.
in this article, only the Signature 915 dataset is a 3D array; We then present more detailed results for each dataset.
therefore, NTF is also directly applied to this particular 3.1. Synthesis of calculated metrics over all datasets
dataset.
Based on the average index across all seven indices, ISM
Detailed analysis steps are provided in Workflow 3. ranks first in the UCI Digits, Signature 915, and Reuters
Workflow 3. Analysis steps datasets, while ILSM ranks first in the prokaryotic dataset
and the TEA-seq multi-omic single-cell dataset (although
Input: 2D map projection of the data transformation in the latent space. very close to ISM for the latter dataset, 0.80 vs. 0.79,
Output: Cluster purity index. respectively). It is easy to explain why ILSM performed
1: Perform K-means with k equal to the number of known classes; much better than ISM on the prokaryotic dataset
2: For each cluster, identify the main class related to the cluster, that (0.52 vs. 0.37, respectively): Since ISM first performs
is, the class corresponding to the majority of observations in the a global factorization over concatenated views (Unit 1
cluster; of Workflow 1), it tends to ignore the smallest views
3: Merge contiguous clusters that refer to the same class or ignore when they are extremely unbalanced, as is the case in
them if not contiguous; the prokaryotic dataset. However, when using ILSM,
4: for each cluster do separate factorizations are applied to each view, and ISM
5: p =proportion of the main class in relation to all elements in the itself is applied to transformed views of equal size. As a
1
cluster; result, the original views with the smallest size are given
p =proportion of the main class in cluster c in relation to all equal weight. Among the criteria used, the proportion of
2
elements of the same class; classes retrieved, purity, and sparsity indices are the most
6: If p < 0.5 then
2 discriminative. It is noteworthy that NMF performs as

Volume 1 Issue 3 (2024) 97 doi: 10.36922/aih.3427

98 99 100 101 102 103 104 105 106 107 108