Page 20 - AIH-2-4

P. 20

Artificial Intelligence in Health AI editorial policy ethics

suicidal from non-suicidal self-harm, the other critiquing a representation artificially, leading to overfitting on
speech-based suicide risk detection system – were rejected synthetic samples that do not adequately represent real-
not for inaccuracies in my evaluation, but for being “overly world variation. 18,23 This undermines model robustness and
technical” or “lacking clinical relevance.” 11-31 In one case, compromises generalizability across unseen populations
editorial processes allowed the original authors to pre-clear and clinical settings. In my correspondence, I wrote:
critiques, undermining the independence of peer review Class imbalance remains one of the most significant
and suppressing substantive methodological discussion. 12 challenges in supervised machine learning, particularly
These cases are not outliers. They reflect a deeper, in domains, such as adolescent self-harm, where suicide
systemic issue in how interdisciplinary research is handled attempts represent a small portion of the dataset. The
in clinical publishing. Through these case studies, this synthetic oversampling techniques employed, while
perspective contributes to the ongoing discourse on peer well-intentioned, may risk overfitting and undermine
review integrity by identifying structural editorial failures, generalizability.
analyzing their ethical and scientific implications, and The clinical adoption of AI models hinges on
proposing reforms to align publishing practices with the transparent decision-making processes that clinicians can
technical demands of AI-integrated mental health research.
understand and trust. The original study lacked sufficient
2. The challenge of evaluating AI in clinical interpretability measures to explain how the model
publishing attributed importance to various features. I proposed
integrating SHAP (SHapley Additive exPlanations)
While transformative, AI and ML methods are not values to provide fine-grained, interpretable insights into
immune to significant flaws. 10,13,14 Unlike conventional feature contributions. SHAP values allow clinicians to see
clinical research methods (e.g., randomized controlled which factors most influenced the model’s predictions in
trials, cohort studies, case-control studies, cross- individual cases, facilitating informed clinical judgment
sectional studies, case reports, and systematic reviews), and improving acceptance in high-stakes settings. 16,17
AI-driven studies and studies using AI methods demand Specifically, I noted:
a nuanced understanding of data science principles, Integrating SHAP values could enhance the transparency
algorithmic transparency, model generalizability, and of the model’s feature attribution, making the system
ethical implications. Peer reviewers and editors in clinical more interpretable to clinicians and better suited for
19
journals, who may not be versed in the complexities of high-stakes environments.
computational models, can unintentionally overlook or
misinterpret issues that would be immediately evident to Adolescents’ behavioral and clinical profiles vary
AI specialists. 13 widely across different populations and healthcare
contexts. The study’s model was trained on a relatively
3. Case study 1: Methodological limitations homogeneous sample, limiting its applicability elsewhere.
of Haghish (2025) I suggested employing transfer learning techniques,
which allow models to leverage knowledge from related
This challenge was starkly evident when I submitted some
correspondence to a high-impact psychiatry journal datasets or tasks to improve performance on new, diverse
24,26
regarding a 2025 study by Haghish, titled “Differentiating cohorts. Transfer learning offers a path to improve
Adolescent Suicidal and Nonsuicidal Self-Harm with model adaptability and external validity, a key requirement
Artificial Intelligence.” My critique focused on several for any AI tool intended for broad clinical use:
11
key methodological concerns, including class imbalance, Transfer learning could offer a viable path to improve
model interpretability, and generalizability, all essential to generalizability, particularly across diverse clinical
validate that AI models are both scientifically sound and settings or populations not represented in the original
clinically applicable. 15-29 training data.

Class imbalance is a pervasive problem in supervised The letter was ultimately rejected, with editorial
ML, especially in sensitive domains, such as adolescent feedback stating that these methodological concerns were
self-harm, where suicidal attempts constitute a small “outside the journal’s thematic scope.” 13,14 While editorial
minority of the dataset. 18,22,23,27,28 While Haghish employed discretion is understandable, this dismissal raises deeper
synthetic oversampling techniques (specifically the issues about how clinical journals vet AI-driven research.
18
synthetic minority oversampling technique, SMOTE), By sidelining fundamental questions about model rigor
these methods – although well-intentioned – carry and applicability, the editorial board risks perpetuating the
inherent risks. Oversampling can inflate minority class publication of AI studies that lack sufficient scientific and

Volume 2 Issue 4 (2025) 14 doi: 10.36922/AIH025210049

15 16 17 18 19 20 21 22 23 24 25