Page 68 - IJOCTA-15-1
P. 68
O. Ayana, D. F. Kanbak, M. Kaya Keles / IJOCTA, Vol.15, No.1, pp.50-70 (2025)
most frequently occurring words within the com- it only applies the 4th preprocess with the 0001,
ments, which are subsequently used to transform which is scenario number 2. This shows us that
the comments into TF-IDF vectors, as detailed in only the stopwords removal operation is applied.
Section 3.3. To assess the impact of varying k, we
apply all ML algorithms to both the raw dataset An analysis of Figure 4 reveals that scenarios 3, 7,
and the dataset that has undergone the complete 11, and 15 result in a decrease in the F-score com-
preprocessing procedure for each value of k. pared to their preceding values. Upon examining
these scenarios, we conclude that the absence of
The experimental results, as illustrated in Fig- the stemming process in the previous scenarios
ure 3, consider both F-score and temporal varia- contributes to a decline in classification perfor-
tion with respect to the k value. Notably, the find- mance when stemming is introduced in the spec-
ings indicate that the algorithms generally yield ified scenarios. Notably, KNN does not exhibit a
better performance on the raw dataset, with the contrasting trend relative to the other algorithms.
exception of the KNN algorithm at a k value of
1000. This observation suggests that the applica- When we analyze scenarios 2, 6, 9, and 13 where
tion of the preprocessing steps may not be advan- algorithms generally perform well, we observe
tageous in all scenarios. that the removal of punctuation and stopwords
yields significant improvements in results. Ulti-
mately, we find that MNB excels in scenario 6,
Regarding temporal performance, it is observed
achieving an F-score of 0.902 with the combina-
that the SVM algorithm demonstrates improved
tion of punctuation and stopwords removal.
efficiency on the preprocessed dataset, while
the remaining three algorithms exhibit negligible
Based on these findings, we have decided to im-
changes in execution time. Importantly, MNB
plement the BSO using the MNB classifier for SA.
achieves the highest performance, attaining an F-
For preprocessing, we determined that the appli-
score of 0.899 with a runtime of 0.016 seconds.
cation of punctuation and stopwords removal pro-
cedures is appropriate for the dataset.
Furthermore, the results indicate that an increase
in the k value, excluding KNN, generally leads to
enhanced performance for both SVM and MNB. 4.2.3. ML vs DL
However, for RF, a decline is noted specifically at
the k value of 10000. Based on these findings, we Before conducting the BSO for SA, we aim to
have determined to employ a k value of 20000 in verify the findings presented in Sections 4.2.1
subsequent experiments. and 4.2.2, which suggest that preprocessing may
not yield optimal results and, in certain cases,
may even have detrimental effects. To this end,
4.2.2. Finding the appropriate preprocessing
we evaluate the proposed DL model as described
combination
in Section 4.1, comparing its performance against
the MNB classifier, which has demonstrated supe-
Following the determination of the optimal k rior results among the traditional ML algorithms.
value in Section 4.2.1, this section examines 16
different preprocessing combination scenarios, as In this analysis, we implement the BiLSTM model
presented in Table 1, utilizing the most frequently on both the raw dataset and the dataset subjected
occurring 20000 words for each ML algorithm. to comprehensive preprocessing. To ascertain the
The objective of this analysis is to identify the optimal values for each parameter, we employ the
preprocessing combination that yields the highest GS method within the parameter search space de-
F-score value, as well as to ascertain the prepro- tailed in Table 4. The parameter values that yield
cessing steps that positively or negatively influ- the best performance are also summarized in Ta-
ence the results. ble 4.
Figure 4 presents the results of the preprocess- Figure 5 illustrates a notable decline in the perfor-
ing combinations. The x-axis indicates the sce- mance of the BiLSTM model when preprocessing
nario numbers as listed in Table 1, where each sce- is applied, consistent with the observations made
nario corresponds to specific preprocessing steps for the ML algorithms. Ultimately, MNB achieves
i
applied, denoted by the code C = 1. The y-axis the highest F-score of 0.902 for SA, outperforming
j
represents the F-score achieved for the respective both the ML and DL algorithms.
scenario indicated on the x-axis. For instance,
62

