Page 24 - AIH-1-4
P. 24

Artificial Intelligence in Health                                Optimized clustering in medical app detection



            review, offering insights into existing knowledge on the   exhibit low detection rates with real-world data containing
            subject. Section 3 provides a theoretical background on   numerous zero-day apps. However, the high detection rate
            machine learning techniques commonly applied in app   achieved with anomaly-based machine algorithms is often
            detection, offering a foundational understanding of the   associated with a large false alarm rate, which greatly affects
            methods employed in the field. Section 4 outlines and   their usability and overall performance. The unsupervised
            elucidates the proposed methodology, shedding light on the   machine learning algorithms are the best at detecting
            innovative approach introduced in this research. Section 5   unseen and novel samples in the data. Hence, clustering
            meticulously examines and discusses the results obtained,   methods are usually used to detect zero-day apps. The
            providing a thorough analysis of the outcomes. Finally,   disadvantage of traditional clustering techniques such as
            Section 6 serves as the conclusive segment, summarizing   K-means is the possibility of an incorrect initial choice of
            the key findings and implications derived from the study.  the number of clusters, which can prevent the convergence
                                                               of the output clusters. In the K-means algorithm, deciding
            2. Related works                                   the number of clusters and determining the centroid
            In the literature, the detection of medical apps primarily   for each cluster are vital and often challenging tasks, as
            relies on three prominent methods: the port-based   they directly affect the quality of the resultant clusters.
                                                                             7
            approach, the payload-based approach, and the machine-  Ahmad and Dey  presented a modified description of
            learning approach. In the port-based approach, medical   cluster  centers  to  overcome  the  limitation  of  handling
            apps leverage well-known ports, as registered with IANA,   only numeric data in the K-means algorithm, thereby
            for easy and conventional identification.  The original   enhancing cluster characterization. The intended results
                                              1
            medical apps are registered with specific ports in IANA,   were to overcome the limitation of K-means in dealing
            and these well-known ports are advertised, facilitating   with numeric data, whereby a modified description of
            proper and trivial identification. However, this approach   the cluster center was presented. Another approach using
            has  declined  in  popularity  due  to its susceptibility  to   fuzzy c-means has been proposed by Bezdek  et al.  The
                                                                                                         8
            inaccurate results caused by port obfuscation, particularly   clustering results obtained were integrated into a judgment
            evident in cases where peer-to-peer (P2P) apps obfuscate   matrix, which was then iteratively partitioned to identify
            their identity using well-known ports. 2           the  desired  cluster  number  and  the  result.  Zhou  et al.
                                                                                                             9
              When the limitations of port-based identification   proposed  a  modified  neural  network  backpropagation
            become apparent, the payload-based approach becomes   algorithm to improve detection rates, particularly in cases
            crucial.  This method involves monitoring the entire   where there is an imbalance in the data, with the class of
                  3
                                                                                                    10
            packet content to identify unique and distinctive   interest being a minority class. Anand  et al.  modified
            characteristics. While the payload-based approach exhibits   the placement of the clustering class to overcome the class
            high classification accuracy, it faces several challenges:  imbalance. Their modified backpropagation algorithm
                                                               accelerated the convergence of the neural network. Kumar
            (i)  Deep inspection is time-consuming, which limits   et al.  proposed the under-sampled K-means technique,
                                                                   11
               real-time detection in today’s high-speed networks. 4  effectively removing noisy and weak instances from large
            (ii)  The approach is ineffective with encrypted traffic,   volumes of the majority class. In the work of Wu,  clusters
                                                                                                      12
               allowing P2P apps to escape detection. 5        were seen to be uniform in size despite variations in input
            (iii) Privacy concerns exacerbate the challenges associated   data sizes.
               with this approach. 6
              Advanced prediction techniques and data analytics   3. Theoretical background
            are increasingly employed to enhance productivity and   3.1. Need for medical apps
            efficiency in detecting medical apps, moving beyond
            conventional port-based and payload-based approaches.   Before delving into the key factors contributing to the
            This is because high-speed network connectivity and big   essential nature of health-care apps, it is crucial to explore
            data transfers between sensors and monitoring systems   noteworthy  statistics  and  facts  that  underscore  the
            demand the use of machine learning techniques and data   industry’s growth trajectory. According to Statistica, the
            analytics. These technologies contribute to cost reduction   health-care sector is projected to be one of the top revenue
            and minimize downtime. In the current literature, app   contributors, with estimates suggesting it will increase from
                                                                                                   13
            detection leverages machine learning as a core technology   $25.39 billion in 2017 to $58.8 billion by 2020.  The report
            to improve the detection performance of novel apps.   by Research 2 Guidance indicates that there are 3,25,000
            Unlike signature-based detection algorithms, which   health-care apps available worldwide, with Android
            struggle to identify novel or zero-day apps and often   leading the way forward on the mHealth platform. A recent


            Volume 1 Issue 4 (2024)                         18                               doi: 10.36922/aih.2585
   19   20   21   22   23   24   25   26   27   28   29