Page 61 - IJOCTA-15-1
P. 61

BSO: Binary Sailfish Optimization for feature selection in sentiment analysis

            binary; it can be either enabled or disabled, re-  the same root can be found in the same sen-
            sulting in the removal or retention of alphanu-   tence/document. The large dataset makes it more
            meric characters, numbers, punctuation marks,     difficult for ML algorithms to capture dataset
            and emoticons. Stemming also follows this binary  meanings and patterns (like noise data). For this
            framework, where words are either converted to    reason, we prefer to apply stemming as prepro-
            their root forms or preserved in their initial forms.  cess. The purpose of stemming is to condense
            Lastly, stopwords removal is applied in a binary  a word’s inflectional and sometimes derivation-
            manner, allowing for the elimination or retention  ally related forms to a single base form 73  Previ-
            of stopwords within the text. Consequently, a to-  ously, many different stemming algorithms such
            tal of sixteen distinct preprocessing combinations,  as Lovins Stemmer, 74  Porters Stemmer, 75  and
            represented by binary codes, are outlined in Ta-  Krovetz Stemmer  76  were developed and applied
            ble 1. This study aims to assess the impact of    in the literature as preprocessing for text data.
            these preprocessing techniques on both the per-   In this study, we use the open source Zemberek
            formance of SA classification and the execution   library 40  developed for Turkish words. As a re-
            times of the algorithms.                          sult, it is aimed to decrease the search space by
                                                              reducing words from the same root, thus saving
            3.2.1. Tokenization                               time and memory space.


            This method allows the terms in the file to be    3.3. The Structure of the proposed model
            separated as tokens and processed individually. 69
            Using the terms separated individually, their fre-  In this section, we present a comprehensive defi-
            quencies and weights in the data set can be cal-  nition of the proposed methodology.
            culated.
                                                              Initially, we developed a web crawler using the
            3.2.2. Stopwords removing                         Python programming language to gather user
                                                              comments from online sales platforms, specifically
            Stopword removal is the process of eliminating    Trendyol. This crawler facilitated the collection of
            terms that do not contribute meaningful content   comments associated with various products listed
            to the document and are characterized by their    on these platforms. Subsequently, we integrated
            frequent occurrence across nearly all documents.  the data obtained from Trendyol with previously
            This technique is employed to reduce the search   collected and publicly available data from n11.
            space and enhance the efficiency of subsequent    Detailed information regarding the dataset is pre-
            analyses. 70  Many techniques have been developed  sented in Section 3.1. Following this, the collected
            for automatic identification and deletion of these  and labeled comments are processed to prepare
            redundant words in documents, such as the Clas-   them for the preprocessing stage, during which
            sical method, the Zipf Law method, and Term-      the methods outlined in Section 3.2 are applied
            Based Random Sampling (TBRS),     71  etc. In this  to each comment for data cleansing.
            study, we use the stopwords list prepared by the
            Natural Language Toolkit, 72  which is written for  Subsequently, we conducted two distinct scenarios
            text mining in Python programming language and    on the cleaned data: SA utilizing ML algorithms
            is open source.                                   and SA employing the DL model outlined in Sec-
                                                              tion 4.1. A vocabulary V ∈ {f t , f (t+1) , . . . , f (t+n) }
            3.2.3. Punctuation removing                       is constructed from each unique word t that ap-
                                                              pears in the cleaned comments, where f t denotes
                                                              the frequency of occurrences of word t and n rep-
            For making the texts clear and reducing the       resents the total number of unique words. The
            search space, the punctuation marks are deleted   vocabulary V is then sorted in descending order
            from the text that do not make sense to the text.
                                                              based on the frequencies f t .
            3.2.4. Stemming
                                                              In the scenario involving ML algorithms, each
                                                              comment c is represented as a fixed-length k vec-
                                                                           k
            Terms are used in sentences in different ways de-  tor W c = {R} , where k corresponds to the top-k
            pending on the meaning they attach to the sen-    words in the vocabulary V . The value of each
                                                                      t
            tence, and this is usually in two ways: by suffix-  word W in the vector W c is computed by mul-
                                                                      c
            ing or by derivation. More than one word with     tiplying the Term Frequency (TF) by the Inverse
                                                            55
   56   57   58   59   60   61   62   63   64   65   66