Page 97 - AIH-1-3
P. 97
Artificial Intelligence in Health ISM: A new multi-view space-learning model
whose availability in powerful MATLAB, Python, or R Fourier coefficients of the character shapes, 216 profile
packages ensures scalability, as will be shown in the results correlations, 64 Karhunen-Love coefficients, 240-pixel
section (Section 3), and accessibility for the vast majority averages of the images from 2 × 3 windows, 47 Zernike
of the machine learning community. In addition to the moments, and six morphological features. Each class
NTF components, a view-mapping matrix is estimated to of digits (0 – 9) contains 200 labeled examples.
obtain an interpretable link between the dimensions of the (ii) Signature 915 data: This dataset is available in the
latent space and the original attributes from each view. It is GitHub repository (https://github.com/Advestis/
worth noting that there are some commonalities between adilsm/tree/main/examples/data) in the file “abis_915.
ISM and the anchor-based approaches mentioned above, csv.” It comprises expression data of 915 marker genes
which are discussed further in the discussion section in four patients and 16 cell types). There are four views
(Section 4). of 915 gene markers (one view per patient) measured
The ISM belongs to the class of multi-view latent across 16 different cell types. 26
space representation methods, 13-23 which aim to capture (iii) Reuters dataset: Available in the GitHub repository
underlying factors or concepts that characterize the data in (https://github.com/mbrbic/Multi-view-LRSSC/tree/
the latent space while filtering out noise and redundancy. master/datasets) in the file “Reuters.mat,” Reuters
For MVC applications, performing cluster analysis in dataset contains features of documents in five different
27
the latent space generally results in more accurate and languages over a common set of six categories. All
consistent cluster partitioning. It is noteworthy that these documents are represented in the bag-of-words
24
approaches allow newly collected data (i.e., data that are format. Each of the six classes contains 100 documents,
not part of the data used to train/learn the model) to be resulting in a dataset of 600 documents. The word
embedded in the latent space, thus extending beyond the counts in each view are 21,526, 24,892, 34,121, 15,487,
purpose of MVC. Some of the latent space representation and 11,539 words, respectively.
methods generate NMF-based latent factors 21,23 using (iv) Prokaryotic phyla dataset: Found in the GitHub
regularization parameters that ensure sparsity and repository (https://github.com/mbrbic/Multi-view-
consistency between model parameters across different LRSSC/tree/master/datasets) in the file “prokaryotic.
views. The originality of ISM lies in its simple workflow mat,” prokaryotic phyla dataset contains 551
involving NMF and NTF steps. As a result, ISM produces prokaryotic species described with heterogeneous
28
latent factors whose interpretation is greatly facilitated by multi-view data, including textual data (438
the non-negativity of the attribute loadings that define features), proteome composition encoded as relative
them, since they cannot cancel each other out. The frequencies of amino acids (three features), and gene
interpretability of latent factors is of critical importance if repertoire (393 features) encoded as presence/absence
they are to be used by an investigator as a follow-up tool, indicators of gene families in a genome. Each provided
for example, in a clinical trial comprising several surveys view contains the principal components explaining
22
with heterogeneous content. 90% of the variance. Each species in the dataset is
labeled with its phylum, resulting in four unbalanced
Finally, we show that embedding the views in a 3D categories ranging from 35 to 313 species.
array has broader implications in a number of areas, such (v) TEA-seq multi-omic single-cell dataset: This
as parallelization, federated computing, and distributed dataset, available in the figshare repository (https://
computing, further illustrating the scalability and figshare.com/s/1b13e12f33e83fff7e0e) in the file
versatility of ISM, which extends well beyond the scope of “tea_preprocessed.h5mu,” consists of human
multi-view data analysis. peripheral blood mononuclear cells. It includes paired
2. Data and methods profiling of scRNA-seq (2,500 features), scATAC-seq
(15,000 features), and surface proteins (46 features).
29
2.1. Data As the dataset did not come with cell annotations, an
Five datasets, all with labeled observations, are considered annotation was derived from the clustering of cells
21
in this article. The labeling will be used for the evaluation using MOFA+ with 15 components, resulting in
of the clustering performance of ISM and other methods. seven major cell types: CD4 effector and memory T
Details of the five datasets are as follows: cells, B cells, CD4+ naïve T cells, monocytes, CD8+
(i) UCI Digits dataset: This dataset, available in the T cells, Mucosal-associated invariant T (MAIT) cells,
UCI machine learning repository (https://archive. and natural killer (NK) cells.
25
ics.uci.edu/dataset/72/multiple+features), contains Of note, the UCI Digits and Signature 915 datasets
six heterogeneous views of handwritten digits: 76 cover both aspects of sparsity (because the Signature 915
Volume 1 Issue 3 (2024) 91 doi: 10.36922/aih.3427

