Page 62 - IJOCTA-15-1
P. 62
O. Ayana, D. F. Kanbak, M. Kaya Keles / IJOCTA, Vol.15, No.1, pp.50-70 (2025)
Table 1. Coding of preprocessing methods
Scenario Code Lowercase Conversion Off (0)/ Punctuation Removal Off (0)/ Stemming Off (0) / Stopwords Removal Off (0) /
Number Lowercase Conversion On (1) Punctuation Removal On (1) Stemming On (1) Stopwords Removal On (1)
1 0000 0 0 0 0
2 0001 0 0 0 1
3 0010 0 0 1 0
4 0011 0 0 1 1
5 0100 0 1 0 0
6 0101 0 1 0 1
7 0110 0 1 1 0
8 0111 0 1 1 1
9 1000 1 0 0 0
10 1001 1 0 0 1
11 1010 1 0 1 0
12 1011 1 0 1 1
13 1100 1 1 0 0
14 1101 1 1 0 1
15 1110 1 1 1 0
16 1111 1 1 1 1
Figure 1. The structure of the proposed model
Document Frequency (IDF) of the term t. An ex- subsequently processed by the ML algorithms for
planation of the TF and IDF values can be found SA.
in Section 3.3.1. The rationale for selecting only
k words is that not all words in V are suitable for
classification, and some may have very low fre- In the scenario involving the DL model, each com-
quencies. The results of the experimental scenar- ment c is represented as a fixed-length k vector
+ k
ios conducted to determine the optimal number W c = {Z } , where k corresponds to the top-k
of k words are presented in Section 4.2.1. Finally, words in the vocabulary V . In this context, the
t
the comments converted into TF-IDF vectors are value of each word W is defined by its position
c
divided into training and testing datasets and are in the vocabulary V . The optimal k value for this
56

