Page 29 - AIH-1-4
P. 29
Artificial Intelligence in Health Optimized clustering in medical app detection
number of clusters and identifying clusters of various The proposed methodology integrates two main
shapes and sizes. However, the particular data used in this components: The ANN and the K-means clustering
work are spherical, and hence, no improvement is seen algorithm.
in using alternative clustering algorithms. To address the The ANN is represented as a function ANN: f : R →
m
challenges posed by outliers, the non-deterministic nature R , where m is the dimensionality of the feature space and k
ANN
k
of K-means, and the potential issue of points near cluster is the number of classes. The ANN learns complex patterns
boundaries being assigned to different clusters, several and relationships within dataset X to classify medical apps
strategies are employed. Before clustering, DBSCAN, into different categories.
the outlier detection technique, is applied to identify
and remove outliers from the dataset. This helps prevent The K-means clustering algorithm partitions the dataset
outliers from unduly influencing the clustering results and X into K clusters, C= C , C , where each cluster represents
1
K
improves the robustness of the algorithm. Furthermore, as data points that share similarities.
already discussed, instead of relying on a single random Let µ = µ , µ , µ denote the centroids of the clusters.
1
K
2
initialization for the centroids, the algorithm can be run The objective of K-means clustering is to minimize the
multiple times with different initializations. By averaging within-cluster sum of squared distances, which can be
the results or selecting the best clustering solution based formulated as Equation I:
on a predefined criterion, the impact of the initial centroid
choice can be mitigated. In addition, post-clustering Minimise , K x (I)
and post-processing techniques such as cluster merging, C i
i1
splitting, or reassignment based on proximity or density xC i
can be applied to refine the clustering solution and The neural network aids in fixing the centroids µ
address any misassignments near cluster boundaries. of each cluster within the K-means clustering process,
This can help improve the overall quality of the clustering contributing to superior detection performance.
results. However, these techniques were not used in our To assess the effectiveness of the proposed method,
experiments, as we did not encounter such misassignments. various performance metrics were employed, including
Effective management of hyperparameter tuning for but not limited to accuracy, precision, recall, F1 score,
ANNs used in conjunction with K-means clustering specificity, and area under the receiver operating
was crucial in this study. We utilized cross-validation characteristics (ROC) curve (area under the ROC curve).
techniques, such as k-fold cross-validation, to assess the These metrics provide a quantitative measure of the
performance of different hyperparameter configurations system’s efficiency and enable a comprehensive evaluation
on multiple subsets of the data. This approach ensures of its detection capabilities.
that the hyperparameters selected (app name, description, The theoretical formulation presented here encompasses
category [e.g., medical, fitness, and wellness], developer several key aspects related to the challenges and approaches
information, etc.) generalize well to unseen data and to app detection using machine-learning techniques:
mitigate the risk of overfitting.
(i) Parametric versus non-parametric classifiers:
The section also introduces the performance matrices
employed to assess the effectiveness of the proposed Parametric classifiers, reliant on model building, are
noted for their sluggishness, posing challenges for
method. The chosen performance metrics provide a
quantitative measure of the system’s efficiency, allowing for real-time app detection. Non-parametric classifiers,
a comprehensive evaluation of its detection capabilities. while requiring a set of training data for estimating
app distribution, suffer from the drawback of
4.1. Theoretical formulation necessitating substantial training data for effective
model-building
The proposed methodology involves utilizing the ANN to
classify medical apps and the K-means clustering algorithm (ii) Accuracy paradox: Machine learning techniques
to partition the dataset into clusters, with the neural typically prioritize the detection of known classes,
network aiding in determining the centroids of the clusters. leading to satisfactory prediction accuracy for
The performance of the method is evaluated using a set of established apps. The tendency to concentrate on
performance metrics to assess its detection capabilities. larger classes with known apps can result in poor
prediction accuracies for novel or zero-day apps
Let X= x , x , x denote the dataset consisting of n data
2
n
1
points, where each xi represents a feature vector describing (iii) Zero-day application clustering: Even in cases such
a medical application. as clustering, the focus would be placed on labeling
Volume 1 Issue 4 (2024) 23 doi: 10.36922/aih.2585

