Page 53 - AIH-2-4
P. 53

Artificial Intelligence in Health





                                        ORIGINAL RESEARCH ARTICLE
                                        Comparison of synthetic data generation

                                        techniques for obesity level prediction based on
                                        dietary habits and physical status



                                        Hakan Alp Eren 1  , Halil İbrahim Emek 2  , and Sinem Bozkurt Keser *
                                                                                                   2
                                        1 Department of Software Engineering, Faculty of Engineering and Architecture, Eskişehir Osmangazi
                                        University, Eskişehir, Türkiye
                                        2 Department of Computer Engineering, Faculty of Engineering and  Architecture, Eskişehir
                                        Osmangazi University, Eskişehir, Türkiye



                                        Abstract

                                        In the contemporary context of the obesity epidemic and its associated comorbidities,
                                        early detection of individuals at risk is critical. Artificial intelligence and machine
                                        learning techniques offer substantial potential for automating obesity risk
                                        assessment, enabling early diagnosis and intervention. However, the development
                                        of robust predictive models is often hampered by limited or imbalanced datasets.
                                        Synthetic data generation has emerged as a key solution, allowing the expansion
                                        and balancing of data while preserving privacy. Recent surveys highlight that the
                                        synthetic minority oversampling technique (SMOTE) is a leading method for data
                                        generation in obesity detection. In line with this, our study analyzed the Estimation
            *Corresponding author:      of Obesity Levels dataset, a dataset from the University of California, Irvine repository,
            Sinem Bozkurt Keser         focused on dietary habits and physical condition, which suffers from class imbalance.
            (sbozkurt@ogu.edu.tr)       We compared three synthetic data generation approaches: SMOTE—nominal and
            Citation: Eren HA, Emek Hİ,   continuous, variational autoencoders, and conditional tabular generative adversarial
            Keser SB. Comparison of synthetic   network. We trained multiple classifiers on the generated datasets and evaluated
            data generation techniques for
            obesity level prediction based on   their performance. Classifiers trained on data including height and weight (i.e., body
            dietary habits and physical status.   mass index [BMI]-related features) achieved F1-scores of up to 98.16%, as expected
            Artif Intell Health. 2025;2(4):47-74.   due to the direct role of BMI in obesity classification. Crucially, models trained without
            doi: 10.36922/AIH025140027  height and weight still achieved an F1-score of 74.48% when synthetic augmentation
            Received: April 1, 2025     was used, demonstrating that useful obesity prediction models can be developed
            Revised: June 2, 2025       even in the absence of explicit anthropometric measures. These results indicate that
                                        synthetic data can enable accurate classification when key features are missing or
            Accepted: June 10, 2025     when data are scarce.
            Published online: June 25, 2025
            Copyright: © 2025 Author(s).   Keywords: Obesity; Synthetic data; Tabular data; Data augmentation; Machine learning;
            This is an Open-Access article
            distributed under the terms of the   Class imbalance
            Creative Commons Attribution
            License, permitting distribution,
            and reproduction in any medium,
            provided the original work is
            properly cited.             1. Introduction
            Publisher’s Note: AccScience   According to the World Health Organization, obesity is defined as the accumulation
            Publishing remains neutral with   of fat in the body to an extent that impairs health. The rise in obesity rates has become
            regard to jurisdictional claims in
            published maps and institutional   a growing concern not only in high-income countries but also in middle- and low-
                                                    1
            affiliations.               income nations.  The increasing prevalence of obesity across all age groups is linked to

            Volume 2 Issue 4 (2025)                         47                          doi: 10.36922/AIH025140027
   48   49   50   51   52   53   54   55   56   57   58