Pubmed:
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

dc.contributor.authorAydin, C.
dc.contributor.authorDuygu, O.B.
dc.contributor.authorKarakas, A.B.
dc.contributor.authorEr, E.
dc.contributor.authorGokmen, G.
dc.contributor.authorOzturk, A.M.
dc.contributor.authorGovsa, F.
dc.date.accessioned2025-08-29T06:42:28Z
dc.date.issued2025
dc.description.abstractGeneral-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin's CCC), inter-rater reliability (Cohen's κ), and measurement agreement (Bland-Altman LoA). The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0-14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: -21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ -0.106). Inter-rater reliability fell below random chance (ChatGPT κ = -0.039). Universal proportional bias (slopes ≈ -1.0) caused severe curve underestimation (e.g., 10-15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3-2.8° vs. 2.6-10.7°) but suboptimal specificity (21.7-26.1%) and hazardous lumbar concordance (CCC: -0.123). General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480-1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities.
dc.identifier.doi10.3390/medicina61081342
dc.identifier.issue8
dc.identifier.pubmed40870387
dc.identifier.urihttps://hdl.handle.net/20.500.12597/34893
dc.identifier.volume61
dc.language.isoen
dc.rightsinfo:eu-repo/semantics/openAccess
dc.subjectadolescent
dc.subjectartificial intelligence
dc.subjectclinical competence
dc.subjectdiagnostic errors
dc.subjectneural networks
dc.subjectphotography
dc.subjectscoliosis
dc.titleClinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
dc.typeArticle
dspace.entity.typePubmed
person.identifier.orcid0000-0003-4169-7919
person.identifier.orcid0000-0001-7140-7340
person.identifier.orcid0000-0001-6504-6489
person.identifier.orcid0000-0001-6366-3301
person.identifier.orcid0009-0006-3510-8099
person.identifier.orcid0000-0001-8674-8877
person.identifier.orcid0000-0001-9635-6308

Files