Page 101 - AIH-2-1
P. 101

Artificial Intelligence in Health                                     EBNA1 inhibitors against EBV in NPC



              Due to the involvement of EBNA1 in EBV’s persistence   features. We used two search methods for CFS: Best first
            and oncogenesis, we decided to deploy QSAR modeling   (BF) and greedy stepwise (GS). The BF method searches
            to identify inhibitors targeting EBNA1. At present, QSAR   the attribute space by greedy hill climbing augmented with
            applications in search of EBNA1 inhibitors remain   a backtracking facility, while the GS method performs a
            unexplored in the current scientific literature. To bridge this   greedy forward or backward search through the space of
            gap, our research aims to identify potential compounds with   attribute subsets. 38,39
            inhibitory activities against EBNA1 using our QSAR models.
                                                               2.2.2. CSE
            2. Data and methods                                This method uses an algorithm to estimate the “merit”
                                                                         37
            2.1. Dataset preparation                           of attributes.  We used several algorithms for CSE to
                                                               select the top attributes. For classification modeling, we
            We  developed  the  QSAR  models  using  the  AID2381   employed  algorithms  Naïve  Bayes  (NB),  instance-based
            dataset obtained from a study by Gianti  et al.  into   learner (IBK), J48 Decision Tree (J48), random forest (RF),
                                                     33
            molecular descriptors and fingerprints. All the compounds   and logistic regression (LR). For regression modeling,
            in the dataset demonstrated inhibitory activity toward   we used linear regression (LRE), simple linear regression
            EBNA1 through  in  vitro studies. The compounds in the   (SLR), sequential minimal optimization (SMO) regression,
            database were experimentally evaluated using fluorescence   IBK, and RF algorithms. We also employed search methods
            polarization assay and were shown to inhibit EBNA1   BF and GS for CSE. For better visualization, we show the
            selectively. First, we split the dataset into a training set and   attribute selection process in this study (Figure 1).
            an external test set with a ratio of approximately 4:1. This
            yields a training set with 34 compounds and a test set with   2.3. Classification QSAR model
            nine compounds. The compounds from these two datasets   After  the  attribute  selection  process,  we  built  the
            were then featured with chemical fingerprints using   classification models using the NB, IBK, J48, RF, and LR
            the PaDEL-Descriptor package. In total, 1024 chemical   algorithms.
            fingerprints were generated for each chemical compound in
            both datasets. After conversion into chemical fingerprints,   2.3.1. Evaluation metrics for classification
            we cleaned the dataset by removing empty rows and
            columns. In addition, we extracted the bioactivity of the   The performance of the classification model was evaluated
            ligands in pIC  format.                            using standard metrics, including precision, recall, F1
                                                               score, and accuracy. Precision is a metric that evaluates the
                       50
            2.2. Attribute selection                           accuracy of correct predictions. It is calculated by dividing
                                                               the number of accurate positive predictions by the total
            We constructed the QSAR models using the Waikato   number of positive predictions. 40
            Environment of Knowledge Analysis (WEKA) package.
                                                         34
            WEKA is a software consisting of an extensive collection   Precision   TP                     (I)
            of machine learning algorithms for data mining and           TP FP
            exploration.  Before model construction, we performed
                     35
            attribute selection to identify the most relevant features for   where TP is true positive, and FP is false positive.
            the model construction.  There are two parts to selecting   The  recall  metric  measures  the  number  of  actual
                               36
            the  attributes:  Attribute  evaluation and search method.   observations predicted correctly. It is determined by
            The attribute evaluation assesses each attribute related   dividing the number of correct positive predictions by the
            to the output variable within the dataset. We applied two   total number of actual positive instances. 40
            methods of attribute evaluation: CfsSubsetEval (CFS) and
            ClassifierSubsetEval (CSE).                        Recall   TP                                (II)
            2.2.1. CFS                                                TP FN

            This method evaluates the worth of a subset of attributes by   where TP is true positive, and FN is false negative.
            considering each feature’s predictive ability and the degree   F1 score is a metric that calculates the harmonic
            of redundancy between them. Subsets of features highly   mean  between  precision and  recall. The  formula of  F1
            correlated with the class while having low intercorrelation   score, which provides a balanced measure of a model’s
            are preferred.  To select attributes, the attribute evaluator   performance, is given as follows:
                       37
            will employ a search method. The search method

            systematically explores various combinations of attributes   Fscore   2 precisionrecall     (III)
                                                                1

            within the dataset, aiming to identify a selection of preferred   precisionrecall
            Volume 2 Issue 1 (2025)                         95                               doi: 10.36922/aih.4375
   96   97   98   99   100   101   102   103   104   105   106