Page 61 - IJOCTA-15-1
P. 61
BSO: Binary Sailfish Optimization for feature selection in sentiment analysis
binary; it can be either enabled or disabled, re- the same root can be found in the same sen-
sulting in the removal or retention of alphanu- tence/document. The large dataset makes it more
meric characters, numbers, punctuation marks, difficult for ML algorithms to capture dataset
and emoticons. Stemming also follows this binary meanings and patterns (like noise data). For this
framework, where words are either converted to reason, we prefer to apply stemming as prepro-
their root forms or preserved in their initial forms. cess. The purpose of stemming is to condense
Lastly, stopwords removal is applied in a binary a word’s inflectional and sometimes derivation-
manner, allowing for the elimination or retention ally related forms to a single base form 73 Previ-
of stopwords within the text. Consequently, a to- ously, many different stemming algorithms such
tal of sixteen distinct preprocessing combinations, as Lovins Stemmer, 74 Porters Stemmer, 75 and
represented by binary codes, are outlined in Ta- Krovetz Stemmer 76 were developed and applied
ble 1. This study aims to assess the impact of in the literature as preprocessing for text data.
these preprocessing techniques on both the per- In this study, we use the open source Zemberek
formance of SA classification and the execution library 40 developed for Turkish words. As a re-
times of the algorithms. sult, it is aimed to decrease the search space by
reducing words from the same root, thus saving
3.2.1. Tokenization time and memory space.
This method allows the terms in the file to be 3.3. The Structure of the proposed model
separated as tokens and processed individually. 69
Using the terms separated individually, their fre- In this section, we present a comprehensive defi-
quencies and weights in the data set can be cal- nition of the proposed methodology.
culated.
Initially, we developed a web crawler using the
3.2.2. Stopwords removing Python programming language to gather user
comments from online sales platforms, specifically
Stopword removal is the process of eliminating Trendyol. This crawler facilitated the collection of
terms that do not contribute meaningful content comments associated with various products listed
to the document and are characterized by their on these platforms. Subsequently, we integrated
frequent occurrence across nearly all documents. the data obtained from Trendyol with previously
This technique is employed to reduce the search collected and publicly available data from n11.
space and enhance the efficiency of subsequent Detailed information regarding the dataset is pre-
analyses. 70 Many techniques have been developed sented in Section 3.1. Following this, the collected
for automatic identification and deletion of these and labeled comments are processed to prepare
redundant words in documents, such as the Clas- them for the preprocessing stage, during which
sical method, the Zipf Law method, and Term- the methods outlined in Section 3.2 are applied
Based Random Sampling (TBRS), 71 etc. In this to each comment for data cleansing.
study, we use the stopwords list prepared by the
Natural Language Toolkit, 72 which is written for Subsequently, we conducted two distinct scenarios
text mining in Python programming language and on the cleaned data: SA utilizing ML algorithms
is open source. and SA employing the DL model outlined in Sec-
tion 4.1. A vocabulary V ∈ {f t , f (t+1) , . . . , f (t+n) }
3.2.3. Punctuation removing is constructed from each unique word t that ap-
pears in the cleaned comments, where f t denotes
the frequency of occurrences of word t and n rep-
For making the texts clear and reducing the resents the total number of unique words. The
search space, the punctuation marks are deleted vocabulary V is then sorted in descending order
from the text that do not make sense to the text.
based on the frequencies f t .
3.2.4. Stemming
In the scenario involving ML algorithms, each
comment c is represented as a fixed-length k vec-
k
Terms are used in sentences in different ways de- tor W c = {R} , where k corresponds to the top-k
pending on the meaning they attach to the sen- words in the vocabulary V . The value of each
t
tence, and this is usually in two ways: by suffix- word W in the vector W c is computed by mul-
c
ing or by derivation. More than one word with tiplying the Term Frequency (TF) by the Inverse
55

