Page 196 - IJOCTA-15-1
P. 196
H.H. Yildirim, A. Akusta / IJOCTA, Vol.15, No.1, pp.183-201 (2025)
to a different type of investor or strategic Where x j represents an object value in the clus-
approach. ter, and j is the number of objects in the cluster.
This method provides a robust initial estimate of
While the mathematical foundation of K-means the cluster center by averaging the values of all
clustering is robust, determining the optimal 51
data points within the cluster.
number of clusters is a critical step in the anal-
ysis. To address this, we employ several widely The median value initializes centroids for datasets
recognized methods. with odd data points. The median is determined
by:
3.5.2. Rationale for selecting two clusters
ME = X n+1 ) (20)
(
While the mathematical foundation of K-means 2
clustering is robust, determining the optimal Where n is the total number of data points, X
number of clusters is a critical step in the analy- represents the data points sorted in ascending or-
sis. The decision to separate firms into two clus- der. The median provides a central tendency mea-
ters was supported by several quantitative mea- sure less sensitive to outliers than the mean. 51
sures, each indicating that two clusters provided
In cases where the dataset contains an even num-
the most meaningful and interpretable grouping.
ber of data points, the median is computed as the
Elbow Method: The elbow method plots the sum
average of the two central values:
of squares of the intra-cluster error for each cluster
and considers the point where the value sharply
decreases to be the optimal number of clusters. 50 X n + X n 2
( )
( +1)
2
ME = (21)
The elbow method, which evaluates inertia (the 2
sum of squared distances to the nearest cluster
This formulation ensures that the median accu-
center), showed a significant decrease in inertia
rately reflects the central location of the data
from one to two clusters. However, the rate of 51
points.
decrease slowed markedly after two clusters. This
pattern suggested diminishing returns in cluster An alternative approach for calculating the me-
differentiation beyond two clusters. 51 dian involves the minimum and maximum values
This section outlines the critical mathematical of the dataset. This approach is defined as:
formulations employed in optimizing K-Means
clustering using the Elbow Method. These formu- minimum value + maximum value
lations are essential for implementing and under- ME = 2 (22)
standing the clustering process and determining
the optimal number of clusters. This formula offers a simplified yet effective
method for estimating the median by considering
Normalization is a preprocessing step used to 51
the extremities of the data range.
scale data values within a specified range, typi-
The Euclidean distance is a widely used measure
cally (0, 1). The normalization formula is given
for computing the proximity between two points
by:
in an n-dimensional space. The distance d be-
tween points x and c is calculated as:
x − minValue
∗
X = (18)
maxValue − minValue
d (x, c) =
Where x is the value to be normalized, “min-
q
Value” is the minimum value in the dataset, and (x 1 − c 1 ) + (x 2 − c 2 ) + · · · + (x n − c n ) 2
2
2
“maxValue” is the maximum value. This formula (23)
ensures that the data values are scaled appro-
priately, facilitating the comparison of attributes Where x and c are n-dimensional vectors rep-
with different units or scales. 51 resenting data points, the Euclidean distance
The mean formula is utilized to initialize centroids is the primary metric for assigning data points
in the K-Means algorithm. The centroid µ i of to the nearest cluster centroid in the K-Means
cluster i is computed as the average of all data algorithm. 51
points in the cluster: The Sum of Squared Errors (SSE), also known as
the within-cluster sum of squares, is a measure of
x 1 + x 2 + x 3 + · · · + x j the compactness of the clustering. It is defined
µ i = (19)
j as:
190

