Research Article: SHAP-explained machine-learning model for high-risk gastric cancer identification
Abstract:
Gastric cancer (GC) remains a major public health concern in Asia. Risk prediction tailored to regional biological features such as Helicobacter pylori ( H. pylori ) status and high-risk mucosal findings such as atrophic gastritis (AG) and intestinal metaplasia (IM) may help improve the screening workflow.
Using a large, real-world, nationwide screening cohort with available endoscopic AG/IM codes, we developed 2-year GC risk prediction models that integrate AG/IM with regional demographic and lifestyle factors. We compared a conventional Cox proportional hazards model (CPHM) with the following machine learning (ML) approaches: extreme gradient boosting (XGBoost), decision tree (DT), and logistic regression (LR). Discrimination and calibration were evaluated through internal and external validations. Model interpretability was assessed using Shapley Additive Explanations (SHAP).
The XGBoost model demonstrated the best overall performance, achieving an AUROC of 0.764 (95% CI, 0.755–0.767), a sensitivity of 0.607 (95% CI, 0.560–0.650), and a specificity of 0.746 (95% CI, 0.744–0.750) in the internal validation. In the external validation cohort, XGBoost also showed the highest discrimination with an AUROC of 0.708 (95% CI, 0.682–0.884), a sensitivity of 0.666 (95% CI, 0.470–0.830), and a specificity of 0.597 (95% CI, 0.590–0.600). SHAP analysis consistently identified Helicobacter pylori infection, age, sex, smoking, and atrophic gastritis/intestinal metaplasia (AG/IM) as the major contributors to increased predicted gastric cancer risk.
This externally validated and interpretable short-term GC risk model incorporating endoscopically ascertained AG/IM could provide a practical approach for informing risk-adapted screening workflows. The model could help identify individuals at a higher predicted risk for prospective evaluation and closer clinical review. In addition, SHAP clarifies the main contributors to each prediction by highlighting factors most strongly associated with a higher predicted risk.
Introduction:
Gastric cancer (GC) remains a major public health concern in Asia. Risk prediction tailored to regional biological features such as Helicobacter pylori ( H. pylori ) status and high-risk mucosal findings such as atrophic gastritis (AG) and intestinal metaplasia (IM) may help improve the screening workflow.
Read more