Page 97 - AIH-1-3
P. 97

Artificial Intelligence in Health                                 ISM: A new multi-view space-learning model



            whose availability in powerful MATLAB, Python, or R   Fourier coefficients of the character shapes, 216 profile
            packages ensures scalability, as will be shown in the results   correlations, 64 Karhunen-Love coefficients, 240-pixel
            section (Section 3), and accessibility for the vast majority   averages of the images from 2 × 3 windows, 47 Zernike
            of  the  machine learning  community.  In addition  to  the   moments, and six morphological features. Each class
            NTF components, a view-mapping matrix is estimated to   of digits (0 – 9) contains 200 labeled examples.
            obtain an interpretable link between the dimensions of the   (ii)  Signature 915 data: This dataset is available in the
            latent space and the original attributes from each view. It is   GitHub  repository  (https://github.com/Advestis/
            worth noting that there are some commonalities between   adilsm/tree/main/examples/data) in the file “abis_915.
            ISM and the anchor-based approaches mentioned above,   csv.” It comprises expression data of 915 marker genes
            which are discussed further in the discussion section   in four patients and 16 cell types). There are four views
            (Section 4).                                          of 915 gene markers (one view per patient) measured

              The ISM belongs to the class of multi-view latent   across 16 different cell types. 26
            space representation methods, 13-23  which aim to capture   (iii) Reuters dataset: Available in the GitHub repository
            underlying factors or concepts that characterize the data in   (https://github.com/mbrbic/Multi-view-LRSSC/tree/
            the latent space while filtering out noise and redundancy.   master/datasets) in the file “Reuters.mat,” Reuters
            For MVC applications, performing cluster analysis in   dataset contains features of documents in five different
                                                                                                         27
            the latent space generally results in more accurate and   languages over a common set of six categories.  All
            consistent cluster partitioning.  It is noteworthy that these   documents are represented in the bag-of-words
                                    24
            approaches allow newly collected data (i.e., data that are   format. Each of the six classes contains 100 documents,
            not part of the data used to train/learn the model) to be   resulting in a dataset of 600 documents. The word
            embedded in the latent space, thus extending beyond the   counts in each view are 21,526, 24,892, 34,121, 15,487,
            purpose of MVC. Some of the latent space representation   and 11,539 words, respectively.
            methods generate NMF-based latent factors 21,23  using   (iv)  Prokaryotic phyla dataset: Found in the GitHub
            regularization parameters that ensure sparsity and    repository  (https://github.com/mbrbic/Multi-view-
            consistency  between  model  parameters  across  different   LRSSC/tree/master/datasets) in the file “prokaryotic.
            views. The originality of ISM lies in its simple workflow   mat,” prokaryotic phyla dataset contains 551
            involving NMF and NTF steps. As a result, ISM produces   prokaryotic species described with heterogeneous
                                                                                 28
            latent factors whose interpretation is greatly facilitated by   multi-view data,  including textual data (438
            the non-negativity of the attribute loadings that define   features), proteome composition encoded as relative
            them, since they cannot cancel each other out. The    frequencies of amino acids (three features), and gene
            interpretability of latent factors is of critical importance if   repertoire (393 features) encoded as presence/absence
            they are to be used by an investigator as a follow-up tool,   indicators of gene families in a genome. Each provided
            for example, in a clinical trial comprising several surveys   view contains the principal components explaining
                                                                                   22
            with heterogeneous content.                           90% of the variance.  Each species in the dataset is
                                                                  labeled with its phylum, resulting in four unbalanced
              Finally, we show that embedding the views in a 3D   categories ranging from 35 to 313 species.
            array has broader implications in a number of areas, such   (v)  TEA-seq multi-omic single-cell dataset: This
            as  parallelization,  federated  computing,  and  distributed   dataset, available in the  figshare repository (https://
            computing, further illustrating the scalability and   figshare.com/s/1b13e12f33e83fff7e0e) in the file
            versatility of ISM, which extends well beyond the scope of   “tea_preprocessed.h5mu,”  consists  of  human
            multi-view data analysis.                             peripheral blood mononuclear cells. It includes paired
            2. Data and methods                                   profiling of scRNA-seq (2,500 features), scATAC-seq
                                                                  (15,000 features), and surface proteins (46 features).
                                                                                                            29
            2.1. Data                                             As the dataset did not come with cell annotations, an
            Five datasets, all with labeled observations, are considered   annotation was derived from the clustering of cells
                                                                                                 21
            in this article. The labeling will be used for the evaluation   using MOFA+ with 15 components,  resulting in
            of the clustering performance of ISM and other methods.   seven major cell types: CD4 effector and memory T
            Details of the five datasets are as follows:          cells, B cells, CD4+ naïve T cells, monocytes, CD8+
            (i)  UCI Digits dataset: This dataset, available in the   T cells, Mucosal-associated invariant T (MAIT) cells,
               UCI machine learning repository  (https://archive.  and natural killer (NK) cells.
                                           25
               ics.uci.edu/dataset/72/multiple+features),  contains  Of note, the UCI Digits and Signature 915 datasets
               six heterogeneous views of handwritten digits: 76   cover both aspects of sparsity (because the Signature 915


            Volume 1 Issue 3 (2024)                         91                               doi: 10.36922/aih.3427
   92   93   94   95   96   97   98   99   100   101   102