Page 59 - AIH-2-4
P. 59

Artificial Intelligence in Health                                   Synthetic data for obesity level prediction



              Forte  et al.  developed a deep learning-based NN   medicine. Yang et al.  reviewed multiclass oversampling
                        32
                                                                                35
            model aimed at classifying obesity risks among Portuguese   for imbalanced health datasets, noting an emerging trend
                                                  ®
            adolescents. The model used the FITescola  dataset,   toward hybrid methods combining SMOTE with other
            which includes information on physical fitness levels and   strategies. While SMOTE-NC (used in our study) is a
                                                                       35
            BMI percentiles. Leveraging the power of deep learning,   straightforward approach that interpolates minority-class
            specifically convolutional NNs, the study aimed to   samples in mixed-type data, more complex generators
            improve the detection of obesity risk patterns in youth.   like GANs can capture non-linear feature dependencies.
            The proposed model achieved a classification accuracy of   Synthetic tabular data in health often requires careful
            96.3%, showcasing the potential of deep NNs to support   evaluation; we leverage standard classification metrics to
            early intervention strategies in public health contexts.  assess model performance on generated data. 7
              Yağın  et al.  proposed a Bayesian-optimized NN for   Recent work on GANs and VAEs shows they can
                        33
            the estimation of obesity levels using a dataset focused on   simulate realistic clinical datasets. For instance, standalone
            lifestyle factors and eating habits obtained from the UCI   reports on conditional tabular GANs (CTGANs) or
            ML Repository. The study utilized a feedforward deep   VAE variants demonstrate their success in reproducing
            NN whose hyperparameters were tuned via Bayesian   distributions of  complex  clinical features.   However,
                                                                                                   6,7
            optimization to maximize predictive accuracy. This   empirical comparisons of these methods (VAE versus GAN
            optimization improved the  network’s  ability  to identify   versus traditional oversampling) in specific applications
            significant patterns in the data by fine-tuning parameters   like obesity remain limited, which motivates our empirical
            such as learning rate and hidden layers. The final model   study. In summary, while many studies have achieved high
            achieved an accuracy of 96.5%, outperforming earlier   accuracy in obesity prediction using ensemble or deep
            approaches and demonstrating the effectiveness of   learning models, they typically rely on the original data
            combining NNs with optimization strategies.        (often including BMI-related attributes).
              Gözükara Bağ et al.  introduced a predictive modeling   3. Materials and methods
                              34
            approach that integrates physical activity and nutritional
            habit data for classifying obesity levels. They utilized   3.1. Dataset definition
            a dataset comprising 2,111 records from the UCI ML   This study utilized the dataset titled Estimation of Obesity
            Repository, which included variables such as gender, BMI,   Levels Based on Eating Habits and Physical Condition.
                                                                                                             5
            dietary patterns, and physical activity. The study employed   The data were collected from individuals in Mexico, Peru,
            ML algorithms, including RF, k-NN, and XGBoost. Feature   and  Colombia,  encompassing  information  on  dietary
            scaling and selection techniques were applied to enhance   habits, physical conditions, and obesity levels. The dataset
            model performance. The highest classification accuracy   contains a total of 2,111 instances and 17 attributes. The
            of  98.87%  was  achieved  using  the  XGBoost  algorithm,   first 498 instances were collected directly from users, while
            underscoring its superiority in handling complex lifestyle-  the remaining samples were synthetically generated by
            related data for obesity classification.           Palechor et al.  using SMOTE. All analyses and synthetic
                                                                          14
              Several works underscore the impact of diet and lifestyle   data generation in this study were conducted using the 498
            features on obesity classification. For example, studies using   user-collected samples. The features included are gender,
            the EOL dataset have identified that eating habits (e.g.,   age, height, weight, family history of obesity, frequent
            frequency of high-calorie food intake, number of meals)   consumption of high-calorie foods, frequency of vegetable
            and lifestyle choices (e.g., mode of transport, frequency   consumption, number of main meals, consumption of
            of physical activity) significantly influence obesity level   food between meals, smoking, daily water consumption,
            predictions. These findings are consistent with nutrition   calorie tracking, frequency of physical activity, frequency
            research showing that “prudent” diet patterns (rich in   of using technological devices, alcohol consumption, type
            fruits and vegetables) are linked to lower obesity, whereas   of transportation used, and obesity level. It is important to
            fast-food – heavy patterns correlate with higher adiposity. 10   note that the dataset contains no missing values. The gender
            Obesity is closely tied to metabolic syndrome markers.   distribution is shown in Figure 1, with 271 males (54.4%) and
            The TyG index study and investigations of oxytocin levels   227 females (45.6%), indicating a relatively balanced sample.
            illustrate that blood biomarkers and hormonal factors   As illustrated in  Figure  2, the data indicate a
            are often elevated in obesity and associated with eating   predominance  of  affirmative  responses,  with  300
            behaviors. 12,13                                   individuals (60.2%)  supporting the  proposition and  198
              In addition to SMOTE, various over-sampling      individuals (39.8%) opposing it. The distribution reflects a
            techniques have been adapted for multiclass problems in   clear majority in favor of the proposition.


            Volume 2 Issue 4 (2025)                         53                          doi: 10.36922/AIH025140027
   54   55   56   57   58   59   60   61   62   63   64