Page 101 - AIH-2-1
P. 101
Artificial Intelligence in Health EBNA1 inhibitors against EBV in NPC
Due to the involvement of EBNA1 in EBV’s persistence features. We used two search methods for CFS: Best first
and oncogenesis, we decided to deploy QSAR modeling (BF) and greedy stepwise (GS). The BF method searches
to identify inhibitors targeting EBNA1. At present, QSAR the attribute space by greedy hill climbing augmented with
applications in search of EBNA1 inhibitors remain a backtracking facility, while the GS method performs a
unexplored in the current scientific literature. To bridge this greedy forward or backward search through the space of
gap, our research aims to identify potential compounds with attribute subsets. 38,39
inhibitory activities against EBNA1 using our QSAR models.
2.2.2. CSE
2. Data and methods This method uses an algorithm to estimate the “merit”
37
2.1. Dataset preparation of attributes. We used several algorithms for CSE to
select the top attributes. For classification modeling, we
We developed the QSAR models using the AID2381 employed algorithms Naïve Bayes (NB), instance-based
dataset obtained from a study by Gianti et al. into learner (IBK), J48 Decision Tree (J48), random forest (RF),
33
molecular descriptors and fingerprints. All the compounds and logistic regression (LR). For regression modeling,
in the dataset demonstrated inhibitory activity toward we used linear regression (LRE), simple linear regression
EBNA1 through in vitro studies. The compounds in the (SLR), sequential minimal optimization (SMO) regression,
database were experimentally evaluated using fluorescence IBK, and RF algorithms. We also employed search methods
polarization assay and were shown to inhibit EBNA1 BF and GS for CSE. For better visualization, we show the
selectively. First, we split the dataset into a training set and attribute selection process in this study (Figure 1).
an external test set with a ratio of approximately 4:1. This
yields a training set with 34 compounds and a test set with 2.3. Classification QSAR model
nine compounds. The compounds from these two datasets After the attribute selection process, we built the
were then featured with chemical fingerprints using classification models using the NB, IBK, J48, RF, and LR
the PaDEL-Descriptor package. In total, 1024 chemical algorithms.
fingerprints were generated for each chemical compound in
both datasets. After conversion into chemical fingerprints, 2.3.1. Evaluation metrics for classification
we cleaned the dataset by removing empty rows and
columns. In addition, we extracted the bioactivity of the The performance of the classification model was evaluated
ligands in pIC format. using standard metrics, including precision, recall, F1
score, and accuracy. Precision is a metric that evaluates the
50
2.2. Attribute selection accuracy of correct predictions. It is calculated by dividing
the number of accurate positive predictions by the total
We constructed the QSAR models using the Waikato number of positive predictions. 40
Environment of Knowledge Analysis (WEKA) package.
34
WEKA is a software consisting of an extensive collection Precision TP (I)
of machine learning algorithms for data mining and TP FP
exploration. Before model construction, we performed
35
attribute selection to identify the most relevant features for where TP is true positive, and FP is false positive.
the model construction. There are two parts to selecting The recall metric measures the number of actual
36
the attributes: Attribute evaluation and search method. observations predicted correctly. It is determined by
The attribute evaluation assesses each attribute related dividing the number of correct positive predictions by the
to the output variable within the dataset. We applied two total number of actual positive instances. 40
methods of attribute evaluation: CfsSubsetEval (CFS) and
ClassifierSubsetEval (CSE). Recall TP (II)
2.2.1. CFS TP FN
This method evaluates the worth of a subset of attributes by where TP is true positive, and FN is false negative.
considering each feature’s predictive ability and the degree F1 score is a metric that calculates the harmonic
of redundancy between them. Subsets of features highly mean between precision and recall. The formula of F1
correlated with the class while having low intercorrelation score, which provides a balanced measure of a model’s
are preferred. To select attributes, the attribute evaluator performance, is given as follows:
37
will employ a search method. The search method
systematically explores various combinations of attributes Fscore 2 precisionrecall (III)
1
within the dataset, aiming to identify a selection of preferred precisionrecall
Volume 2 Issue 1 (2025) 95 doi: 10.36922/aih.4375

