Page 53 - AIH-2-4

P. 53

Artificial Intelligence in Health

ORIGINAL RESEARCH ARTICLE
Comparison of synthetic data generation

techniques for obesity level prediction based on
dietary habits and physical status

Hakan Alp Eren 1 , Halil İbrahim Emek 2 , and Sinem Bozkurt Keser *
2
1 Department of Software Engineering, Faculty of Engineering and Architecture, Eskişehir Osmangazi
University, Eskişehir, Türkiye
2 Department of Computer Engineering, Faculty of Engineering and Architecture, Eskişehir
Osmangazi University, Eskişehir, Türkiye

Abstract

In the contemporary context of the obesity epidemic and its associated comorbidities,
early detection of individuals at risk is critical. Artificial intelligence and machine
learning techniques offer substantial potential for automating obesity risk
assessment, enabling early diagnosis and intervention. However, the development
of robust predictive models is often hampered by limited or imbalanced datasets.
Synthetic data generation has emerged as a key solution, allowing the expansion
and balancing of data while preserving privacy. Recent surveys highlight that the
synthetic minority oversampling technique (SMOTE) is a leading method for data
generation in obesity detection. In line with this, our study analyzed the Estimation
*Corresponding author: of Obesity Levels dataset, a dataset from the University of California, Irvine repository,
Sinem Bozkurt Keser focused on dietary habits and physical condition, which suffers from class imbalance.
(sbozkurt@ogu.edu.tr) We compared three synthetic data generation approaches: SMOTE—nominal and
Citation: Eren HA, Emek Hİ, continuous, variational autoencoders, and conditional tabular generative adversarial
Keser SB. Comparison of synthetic network. We trained multiple classifiers on the generated datasets and evaluated
data generation techniques for
obesity level prediction based on their performance. Classifiers trained on data including height and weight (i.e., body
dietary habits and physical status. mass index [BMI]-related features) achieved F1-scores of up to 98.16%, as expected
Artif Intell Health. 2025;2(4):47-74. due to the direct role of BMI in obesity classification. Crucially, models trained without
doi: 10.36922/AIH025140027 height and weight still achieved an F1-score of 74.48% when synthetic augmentation
Received: April 1, 2025 was used, demonstrating that useful obesity prediction models can be developed
Revised: June 2, 2025 even in the absence of explicit anthropometric measures. These results indicate that
synthetic data can enable accurate classification when key features are missing or
Accepted: June 10, 2025 when data are scarce.
Published online: June 25, 2025
Copyright: © 2025 Author(s). Keywords: Obesity; Synthetic data; Tabular data; Data augmentation; Machine learning;
This is an Open-Access article
distributed under the terms of the Class imbalance
Creative Commons Attribution
License, permitting distribution,
and reproduction in any medium,
provided the original work is
properly cited. 1. Introduction
Publisher’s Note: AccScience According to the World Health Organization, obesity is defined as the accumulation
Publishing remains neutral with of fat in the body to an extent that impairs health. The rise in obesity rates has become
regard to jurisdictional claims in
published maps and institutional a growing concern not only in high-income countries but also in middle- and low-
1
affiliations. income nations. The increasing prevalence of obesity across all age groups is linked to

Volume 2 Issue 4 (2025) 47 doi: 10.36922/AIH025140027

48 49 50 51 52 53 54 55 56 57 58