Background: Papilledema, an ophthalmic finding associated with increased intracranial pressure, can be related to dermatological medications including corticosteroids, isotretinoin, and tetracyclines. Early detection is critical to prevent irreversible optic nerve damage, but access to ophthalmologic expertise is often limited in rural settings. Artificial intelligence (AI) may offer a potential for automated, accurate papilledema detection from fundus images, supporting timely diagnosis and management.
Objective: Evaluating and comparing the diagnostic performance of two AI models, including a ResNet-based convolutional neural network (CNN) and GPT-4, against human ophthalmologists in identifying papilledema from fundus photographs, with a focus on applications relevant to dermatological care in resource-limited environments.
Methods: A dataset of 1,389 fundus images (295 papilledema, 295 pseudo-papilledema, 799 normal) was preprocessed and partitioned into a test set of 198 images (99 papilledema, 99 normal). The ResNet model was fine-tuned using discriminative learning rates and a one-cycle learning rate policy. Both AI models and two human evaluators (a senior ophthalmologist and an ophthalmology resident) independently assessed the test images. Diagnostic metrics such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and Cohen’s Kappa were calculated for each evaluator.
Results: All human evaluators demonstrated high sensitivity for papilledema detection (96.91%–99.0%). The ResNet AI model achieved the highest specificity (100%), PPV (100%), NPV (98.99%), and overall accuracy (99.49%), outperforming both human evaluators (accuracy: 95.96%) and GPT-4 (accuracy: 85.86%, specificity: 78.86%, PPV: 73.74%). Cohen’s Kappa indicated excellent agreement for the ResNet model (0.9899) and human experts (0.9192), with substantial agreement for GPT-4 (0.71).
Conclusions: These results highlight the potential of AI-assisted detection of papilledema in underserved settings. However, they also underscore the need for validation on external datasets and real-world clinical environments before such tools can be broadly implemented.