Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

Aydin, C.; Duygu, O.B.; Karakas, A.B.; Er, E.; Gokmen, G.; Ozturk, A.M.; Govsa, F.

Pubmed:
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

dc.contributor.author	Aydin, C.
dc.contributor.author	Duygu, O.B.
dc.contributor.author	Karakas, A.B.
dc.contributor.author	Er, E.
dc.contributor.author	Gokmen, G.
dc.contributor.author	Ozturk, A.M.
dc.contributor.author	Govsa, F.
dc.date.accessioned	2025-08-29T06:42:28Z
dc.date.issued	2025
dc.description.abstract	General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin's CCC), inter-rater reliability (Cohen's κ), and measurement agreement (Bland-Altman LoA). The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0-14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: -21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ -0.106). Inter-rater reliability fell below random chance (ChatGPT κ = -0.039). Universal proportional bias (slopes ≈ -1.0) caused severe curve underestimation (e.g., 10-15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3-2.8° vs. 2.6-10.7°) but suboptimal specificity (21.7-26.1%) and hazardous lumbar concordance (CCC: -0.123). General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480-1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities.
dc.identifier.doi	10.3390/medicina61081342
dc.identifier.issue	8
dc.identifier.pubmed	40870387
dc.identifier.uri	https://hdl.handle.net/20.500.12597/34893
dc.identifier.volume	61
dc.language.iso	en
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	adolescent
dc.subject	artificial intelligence
dc.subject	clinical competence
dc.subject	diagnostic errors
dc.subject	neural networks
dc.subject	photography
dc.subject	scoliosis
dc.title	Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
dc.type	Article
dspace.entity.type	Pubmed
person.identifier.orcid	0000-0003-4169-7919
person.identifier.orcid	0000-0001-7140-7340
person.identifier.orcid	0000-0001-6504-6489
person.identifier.orcid	0000-0001-6366-3301
person.identifier.orcid	0009-0006-3510-8099
person.identifier.orcid	0000-0001-8674-8877
person.identifier.orcid	0000-0001-9635-6308

Collections

Pubmed İndeksli Yayınlar

Pubmed: Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

Files

Collections

Pubmed:
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study