Pubmed: Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
| dc.contributor.author | Aydin, C. | |
| dc.contributor.author | Duygu, O.B. | |
| dc.contributor.author | Karakas, A.B. | |
| dc.contributor.author | Er, E. | |
| dc.contributor.author | Gokmen, G. | |
| dc.contributor.author | Ozturk, A.M. | |
| dc.contributor.author | Govsa, F. | |
| dc.date.accessioned | 2025-08-29T06:42:28Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin's CCC), inter-rater reliability (Cohen's κ), and measurement agreement (Bland-Altman LoA). The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0-14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: -21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ -0.106). Inter-rater reliability fell below random chance (ChatGPT κ = -0.039). Universal proportional bias (slopes ≈ -1.0) caused severe curve underestimation (e.g., 10-15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3-2.8° vs. 2.6-10.7°) but suboptimal specificity (21.7-26.1%) and hazardous lumbar concordance (CCC: -0.123). General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480-1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities. | |
| dc.identifier.doi | 10.3390/medicina61081342 | |
| dc.identifier.issue | 8 | |
| dc.identifier.pubmed | 40870387 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12597/34893 | |
| dc.identifier.volume | 61 | |
| dc.language.iso | en | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.subject | adolescent | |
| dc.subject | artificial intelligence | |
| dc.subject | clinical competence | |
| dc.subject | diagnostic errors | |
| dc.subject | neural networks | |
| dc.subject | photography | |
| dc.subject | scoliosis | |
| dc.title | Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study | |
| dc.type | Article | |
| dspace.entity.type | Pubmed | |
| person.identifier.orcid | 0000-0003-4169-7919 | |
| person.identifier.orcid | 0000-0001-7140-7340 | |
| person.identifier.orcid | 0000-0001-6504-6489 | |
| person.identifier.orcid | 0000-0001-6366-3301 | |
| person.identifier.orcid | 0009-0006-3510-8099 | |
| person.identifier.orcid | 0000-0001-8674-8877 | |
| person.identifier.orcid | 0000-0001-9635-6308 |
