Performance of generative large language models in answering questions from the Brazilian Retina and Vitreous Society certification exam

Faneli, Adriano Cypriano; Oliveira, Ricardo Danilo Chagas; Nakayama, Luis Filipe; Torres, Rodrigo Amaral; Muccioli, Cristina; Regatieri, Caio Vinicius Saito

doi:10.5935/0004-2749.2025-0113

Contact

Submission

Youtube

Open Access

Peer-Reviewed
Artigo Original

Performance of generative large language models in answering questions from the Brazilian Retina and Vitreous Society certification exam

Adriano Cypriano Faneli¹; Ricardo Danilo Chagas Oliveira²; Luis Filipe Nakayama¹; Rodrigo Amaral Torres²; Cristina Muccioli¹; Caio Vinicius Saito Regatieri¹

DOI: 10.5935/0004-2749.2025-0113

ABSTRACT

PURPOSE: Natural language models and chatbots, particularly OpenAI’s Generative Pre-Trained Transformer architecture, have transformed human interaction with digital interfaces. The latest versions, including ChatGPT-4o, offer enhanced functionalities compared to their predecessors. This study evaluates the accuracy of ChatGPT-4, ChatGPT-4o, and Claude 3.5 Sonnet in answering questions from the Brazilian Retina and Vitreous Society certification exam.
METHODS: We compiled 200 multiple-choice questions from the Brazilian Retina and Vitreous Society 2018 and 2019 exams. Questions were categorized into three domains: Anatomy and Physiology of the Retina, Retinal Pathology, and Diagnosis and Treatment. Using a standardized prompt developed according to prompt design guidelines, we tested ChatGPT-4, ChatGPT-4o, and Claude 3.5 Sonnet, recording their first responses as final. Three retina specialists performed a qualitative analysis of the answers. Accuracy was determined by comparing responses to the official correct answers. Statistical analysis was conducted using chi-square tests and Cohen’s Kappa.
RESULTS: Claude 3.5 Sonnet achieved the highest overall accuracy (72.5%), followed by ChatGPT-4o (66.0%) and ChatGPT-4 (55.5%). Claude 3.5 Sonnet and ChatGPT-4o significantly outperformed ChatGPT-4 (p<0.01 and p=0.03, respectively), while no significant difference was observed between Claude 3.5 Sonnet and ChatGPT-4o (p=0.16). Model responses agreed 74.5% of the time, with a Cohen’s κ of 0.47. Retinal Pathology was the best-performing domain for all models, whereas Anatomy and Physiology of the Retina and Diagnosis and Treatment were the weakest domains for Claude 3.5 Sonnet and ChatGPT-4, respectively.
CONCLUSIONS: This study is the first to assess Claude 3.5 Sonnet, ChatGPT-4, and ChatGPT-4o in retina specialist certification exams. Claude 3.5 Sonnet and ChatGPT-4o significantly outperformed ChatGPT-4, highlighting their potential as effective tools for studying retina specialist board exams. These findings suggest that the enhanced functionalities of Claude 3.5 Sonnet and ChatGPT-4o offer substantial improvements in medical education contexts.

Keywords: Artificial intelligence; ChatGPT; Retina; Medical education; Ophthalmology, Large language model; Natural language processing

THE CONTENT OF THIS ARTICLE IS NOT AVAILABLE FOR THIS LANGUAGE.