Introduction: In recent years, artificial intelligence tools, such as large language models (LLMs) have expanded the potential for diagnostic medicine, including histopathology. This study aims to eva..
Introduction: In recent years, artificial intelligence tools, such as large language models (LLMs) have expanded the potential for diagnostic medicine, including histopathology. This study aims to evaluate the diagnostic ability and utility of the publicly available large language models in predicting the accurate diagnosis of the unknown cases by using the images of the hematoxylin-eosin stained slides taken by a mobile phone and compare their performance with the residents’ performance. Method: The twenty cases, including a variety of entities, were collected from teaching sets of non-HCA patients and public available domains, which are used for unknown slide sessions for residents. Three publicly available LLMs, Chat-GPT 4.0, Claude 3.5 Sonnet, and Gemini 1.5 Flash were used for generating the diagnosis of these H-E slide histology images, using a standard prompt. The same cases were evaluated blindly by four residents. The accuracy of the three LLMs were compared with each other and with the accuracy rate of the residents. Results: The most accurate LLM was the Claude with an accuracy rate of 50%, followed by Gemini (40%) and Chat-GPT (35%). The highest accuracy rate of the LLMs (50%) was lower than the lowest accuracy rate of the residents (55%). The average accuracy rate of the LLMs was 41.66 % versus 67.5 % for residents. Conclusions: The current LLMs are not sufficient for diagnostic use, and need to be improved for better diagnostic accuracy.