Page 196 - IJOCTA-15-1
P. 196

H.H. Yildirim, A. Akusta / IJOCTA, Vol.15, No.1, pp.183-201 (2025)
                    to a different type of investor or strategic  Where x j represents an object value in the clus-
                    approach.                                 ter, and j is the number of objects in the cluster.
                                                              This method provides a robust initial estimate of
            While the mathematical foundation of K-means      the cluster center by averaging the values of all
            clustering is robust, determining the optimal                                  51
                                                              data points within the cluster.
            number of clusters is a critical step in the anal-
            ysis. To address this, we employ several widely   The median value initializes centroids for datasets
            recognized methods.                               with odd data points. The median is determined
                                                              by:

            3.5.2. Rationale for selecting two clusters
                                                                              ME = X n+1   )             (20)
                                                                                       (
            While the mathematical foundation of K-means                                 2
            clustering is robust, determining the optimal     Where n is the total number of data points, X
            number of clusters is a critical step in the analy-  represents the data points sorted in ascending or-
            sis. The decision to separate firms into two clus-  der. The median provides a central tendency mea-
            ters was supported by several quantitative mea-   sure less sensitive to outliers than the mean. 51
            sures, each indicating that two clusters provided
                                                              In cases where the dataset contains an even num-
            the most meaningful and interpretable grouping.
                                                              ber of data points, the median is computed as the
            Elbow Method: The elbow method plots the sum
                                                              average of the two central values:
            of squares of the intra-cluster error for each cluster
            and considers the point where the value sharply
            decreases to be the optimal number of clusters. 50                   X n + X n  2
                                                                                   ( )
                                                                                           ( +1)
                                                                                    2
                                                                          ME =                           (21)
            The elbow method, which evaluates inertia (the                              2
            sum of squared distances to the nearest cluster
                                                              This formulation ensures that the median accu-
            center), showed a significant decrease in inertia
                                                              rately reflects the central location of the data
            from one to two clusters. However, the rate of          51
                                                              points.
            decrease slowed markedly after two clusters. This
            pattern suggested diminishing returns in cluster  An alternative approach for calculating the me-
            differentiation beyond two clusters. 51           dian involves the minimum and maximum values
            This section outlines the critical mathematical   of the dataset. This approach is defined as:
            formulations employed in optimizing K-Means
            clustering using the Elbow Method. These formu-            minimum value + maximum value
            lations are essential for implementing and under-   ME =                  2                  (22)
            standing the clustering process and determining
            the optimal number of clusters.                   This formula offers a simplified yet effective
                                                              method for estimating the median by considering
            Normalization is a preprocessing step used to                                     51
                                                              the extremities of the data range.
            scale data values within a specified range, typi-
                                                              The Euclidean distance is a widely used measure
            cally (0, 1). The normalization formula is given
                                                              for computing the proximity between two points
            by:
                                                              in an n-dimensional space. The distance d be-
                                                              tween points x and c is calculated as:
                                x − minValue
                        ∗
                      X =                              (18)
                            maxValue − minValue
                                                                                                 d (x, c) =
            Where x is the value to be normalized, “min-
                                                                 q
            Value” is the minimum value in the dataset, and        (x 1 − c 1 ) + (x 2 − c 2 ) + · · · + (x n − c n ) 2
                                                                                       2
                                                                           2
            “maxValue” is the maximum value. This formula                                                (23)
            ensures that the data values are scaled appro-
            priately, facilitating the comparison of attributes  Where x and c are n-dimensional vectors rep-
            with different units or scales. 51                resenting data points, the Euclidean distance
            The mean formula is utilized to initialize centroids  is the primary metric for assigning data points
            in the K-Means algorithm. The centroid µ i of     to the nearest cluster centroid in the K-Means
            cluster i is computed as the average of all data  algorithm. 51
            points in the cluster:                            The Sum of Squared Errors (SSE), also known as
                                                              the within-cluster sum of squares, is a measure of
                           x 1 + x 2 + x 3 + · · · + x j      the compactness of the clustering. It is defined
                      µ i =                            (19)
                                      j                       as:
                                                           190
   191   192   193   194   195   196   197   198   199   200   201