Page 68 - IJOCTA-15-1
P. 68

O. Ayana, D. F. Kanbak, M. Kaya Keles / IJOCTA, Vol.15, No.1, pp.50-70 (2025)

            most frequently occurring words within the com-   it only applies the 4th preprocess with the 0001,
            ments, which are subsequently used to transform   which is scenario number 2. This shows us that
            the comments into TF-IDF vectors, as detailed in  only the stopwords removal operation is applied.
            Section 3.3. To assess the impact of varying k, we
            apply all ML algorithms to both the raw dataset   An analysis of Figure 4 reveals that scenarios 3, 7,
            and the dataset that has undergone the complete   11, and 15 result in a decrease in the F-score com-
            preprocessing procedure for each value of k.      pared to their preceding values. Upon examining
                                                              these scenarios, we conclude that the absence of
            The experimental results, as illustrated in Fig-  the stemming process in the previous scenarios
            ure 3, consider both F-score and temporal varia-  contributes to a decline in classification perfor-
            tion with respect to the k value. Notably, the find-  mance when stemming is introduced in the spec-
            ings indicate that the algorithms generally yield  ified scenarios. Notably, KNN does not exhibit a
            better performance on the raw dataset, with the   contrasting trend relative to the other algorithms.
            exception of the KNN algorithm at a k value of
            1000. This observation suggests that the applica-  When we analyze scenarios 2, 6, 9, and 13 where
            tion of the preprocessing steps may not be advan-  algorithms generally perform well, we observe
            tageous in all scenarios.                         that the removal of punctuation and stopwords
                                                              yields significant improvements in results. Ulti-
                                                              mately, we find that MNB excels in scenario 6,
            Regarding temporal performance, it is observed
                                                              achieving an F-score of 0.902 with the combina-
            that the SVM algorithm demonstrates improved
                                                              tion of punctuation and stopwords removal.
            efficiency on the preprocessed dataset, while
            the remaining three algorithms exhibit negligible
                                                              Based on these findings, we have decided to im-
            changes in execution time. Importantly, MNB
                                                              plement the BSO using the MNB classifier for SA.
            achieves the highest performance, attaining an F-
                                                              For preprocessing, we determined that the appli-
            score of 0.899 with a runtime of 0.016 seconds.
                                                              cation of punctuation and stopwords removal pro-
                                                              cedures is appropriate for the dataset.
            Furthermore, the results indicate that an increase
            in the k value, excluding KNN, generally leads to
            enhanced performance for both SVM and MNB.        4.2.3. ML vs DL
            However, for RF, a decline is noted specifically at
            the k value of 10000. Based on these findings, we  Before conducting the BSO for SA, we aim to
            have determined to employ a k value of 20000 in   verify the findings presented in Sections 4.2.1
            subsequent experiments.                           and 4.2.2, which suggest that preprocessing may
                                                              not yield optimal results and, in certain cases,
                                                              may even have detrimental effects. To this end,
            4.2.2. Finding the appropriate preprocessing
                                                              we evaluate the proposed DL model as described
                   combination
                                                              in Section 4.1, comparing its performance against
                                                              the MNB classifier, which has demonstrated supe-
            Following the determination of the optimal k      rior results among the traditional ML algorithms.
            value in Section 4.2.1, this section examines 16
            different preprocessing combination scenarios, as  In this analysis, we implement the BiLSTM model
            presented in Table 1, utilizing the most frequently  on both the raw dataset and the dataset subjected
            occurring 20000 words for each ML algorithm.      to comprehensive preprocessing. To ascertain the
            The objective of this analysis is to identify the  optimal values for each parameter, we employ the
            preprocessing combination that yields the highest  GS method within the parameter search space de-
            F-score value, as well as to ascertain the prepro-  tailed in Table 4. The parameter values that yield
            cessing steps that positively or negatively influ-  the best performance are also summarized in Ta-
            ence the results.                                 ble 4.


            Figure 4 presents the results of the preprocess-  Figure 5 illustrates a notable decline in the perfor-
            ing combinations. The x-axis indicates the sce-   mance of the BiLSTM model when preprocessing
            nario numbers as listed in Table 1, where each sce-  is applied, consistent with the observations made
            nario corresponds to specific preprocessing steps  for the ML algorithms. Ultimately, MNB achieves
                                           i
            applied, denoted by the code C = 1. The y-axis    the highest F-score of 0.902 for SA, outperforming
                                          j
            represents the F-score achieved for the respective  both the ML and DL algorithms.
            scenario indicated on the x-axis. For instance,
                                                            62
   63   64   65   66   67   68   69   70   71   72   73