why choose us

300×250 Ad Slot

Research Article: Performance evaluation of mainstream large language models in autoimmune hepatitis patient education: a comparative study of readability, quality, and reliability

Date Published: 2026-03-20

Abstract:
Autoimmune hepatitis (AIH) is a chronic immune-mediated liver disease that requires long-term management, in which effective patient education plays a critical role. With the rapid development of large language models (LLMs), AI-generated health information is increasingly accessed by patients; however, the readability, quality, and educational suitability of LLM-generated AIH-related content remain insufficiently evaluated. Five widely used LLMs—ChatGPT, Doubao, DeepSeek, Wenxin Yiyan, and Tongyi Qianwen—were assessed based on their responses to 20 frequently asked AIH patient education questions covering five thematic categories. Text readability was evaluated using multiple indices, including the Automated Readability Index, Flesch Reading Ease Score, Gunning Fog Index, Flesch–Kincaid Grade Level, Coleman–Liau Index, SMOG, and Linsear Write formula. Information quality and educational suitability were assessed using the Global Quality Score (GQS) and the Chinese version of the Patient Education Materials Assessment Tool (C-PEMAT). Clinical Intent Alignment (CIA) was used to evaluate the coverage of guideline-defined medical key points based on the 2025 EASL Clinical Practice Guidelines. Inter-rater reliability was analyzed using Cohen’s kappa, and comparative and correlation analyses were performed. Significant differences were observed among the LLMs in readability, information quality, and educational suitability (all p <?0.05). ChatGPT achieved the highest GQS and C-PEMAT scores, followed by Doubao and DeepSeek, whereas Wenxin Yiyan and Tongyi Qianwen showed lower performance and greater variability. CIA analysis indicated comparable coverage of guideline-defined clinical intent across models. Readability varied significantly across content themes, with texts related to disease mechanisms and diagnostic processes exhibiting higher linguistic complexity. Correlation analysis demonstrated moderate associations between GQS and grade-level readability indices, whereas C-PEMAT and CIA showed weak correlations with traditional readability metrics. Substantial variability exists among LLMs in generating AIH patient education materials. Model selection critically influences information quality and educational suitability, whereas content theme primarily affects linguistic complexity. Although most models produced moderate-to-good quality information, relatively high readability levels suggest that further simplification may be needed for general patient populations. A multidimensional evaluation framework integrating readability, quality, educational suitability, and clinical intent alignment is essential for the responsible use of LLMs in AIH patient education.

Introduction:
Autoimmune hepatitis (AIH) is a chronic immune-mediated liver disease that requires long-term management, in which effective patient education plays a critical role. With the rapid development of large language models (LLMs), AI-generated health information is increasingly accessed by patients; however, the readability, quality, and educational suitability of LLM-generated AIH-related content remain insufficiently evaluated.

Read more

300×250 Ad Slot