Now you can hire an LLM to take your physiology test for you.

Evaluating the performance of ChatGPT, Bard and Bing on medical physiology multiple choice questions

Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology

Mayank Agarwal, Priyanka Sharma, Ayan Goswami

Introduction

The research paper introduces the concept of artificial intelligence (AI) as the development of computer systems that mimic human cognitive function to solve complex problems. AI is rapidly advancing in various scientific fields and particularly in healthcare and biomedical research. The study focuses on three AI models: ChatGPT, Bard, and Bing, which generated 110 multiple-choice questions (MCQs) on June 2, 2023. The MCQs were rated by two physiologists on three parameters: validity, difficulty, and reasoning ability using a scale ranging from 0 to 3. The ratings were recorded in an Excel spreadsheet for subsequent analysis. This research sheds light on the cognitive abilities exhibited by AI models and their potential implications in different domains, including healthcare and scientific research.

Parameter Rating

The Parameter Rating section of the research paper confirms that the multiple-choice questions (MCQs) generated by the AI models for the medical physiology subject are deemed valid and clear without ambiguity. The physiologists, who were blinded to the AI model origins of the MCQs, conducted the analysis and provided ratings for validity, difficulty, and reasoning ability. After a week of evaluation, the ratings from the two physiologists were averaged, and the results were compiled for further statistical analysis. The originality of the content was also checked using Turnitin software. Moreover, a figure in the paper outlines the method used in the study, indicating the rigorous process followed. This section elucidates the systematic approach taken to assess the quality and suitability of the AI-generated MCQs for medical physiology competence.

Statistical Analysis

The researchers inputted the data into Microsoft Excel 365 and conducted statistical analysis using IBM SPSS Statistics Version 27.0 for Windows, opting for non-parametric tests due to the ordinal nature of the data. They presented the data using median and interquartile range (Q1-Q3). To compare distribution across total and module-wise responses, they employed the independent sample Kruskal-Wallis test and conducted post-hoc tests for pairwise comparisons. Additionally, they evaluated the agreement in s

cores between two raters using Cohen's Kappa (Κ) and considered a p-value <0.05 to indicate statistical significance.

Results

The study evaluated the performance of ChatGPT, Bard, and Bing in generating multiple-choice questions (MCQs) for the NMC CBME curriculum. ChatGPT and Bard produced 110 MCQs for 22 competencies, while Bing provided only 100 MCQs, failing for two competencies. Bing received '0' ratings for validity, difficulty, and reasoning ability for these competencies. Overall, Bing generated the least valid MCQs, while ChatGPT generated the least difficult ones. Turnitin test similarity indices showed 39% for ChatGPT, 49% for Bard, and 52% for Bing. Inter-rater reliability was strong (Cohen's K ≥ 0.8) for all AIs and parameters. These results highlight differences in the performance of AI models in generating MCQs and emphasize the need for further development to improve validity and difficulty of questions.

The analysis of 110 multiple-choice questions (MCQs) generated by AI models such as ChatGPT, Bard, and Bing revealed interesting findings. ChatGPT appeared to be the slowest in generating MCQs, although specific timings were not recorded. Bard produced 47 MCQs with the stem containing the suffix "Which of the following is the most important," and 54 MCQs had "all of the above" as an option. Moreover, Bard was the only AI that provided answers to the MCQs with an explanation. On the other hand, all the generated MCQs by Bing consistently had the suffix "Which of the following is not," indicating a negative verb construction. These insights provide valuable information on the performance and characteristics of the AI models in generating MCQs, which can have implications for their application in various fields including healthcare and biomedical research.

Discussion

The study evaluated the ability of AI models ChatGPT, Bard, and Bing in creating valid, difficult, and reasoning-based multiple-choice questions (MCQs) in medical physiology. ChatGPT generated the most valid MCQs but the least difficult ones, while all AIs struggled to generate high-level reasoning-based MCQs. Shortcomings in MCQ generation were identified, such as the presence of negative words and non-ideal question structures. Variations in performance across different competency modules and text similarity indexes were observed among the AIs. Despite AI's potential in medical education, challenges in ensuring accuracy and reliability remain. Previous studies have shown the successful use of ChatGPT in medical education and its capacity to provide accurate responses, but its abilities in creating reasoning-based MCQs lag behind human intelligence. The integration of AI into medical education is seen as inevitable and could revolutionize the learning experience, although the accuracy and reliability of information provided by AI systems remain a significant challenge.

Limitations

The research study had several limitations. It focused solely on assessing the ability of AI systems to generate multiple-choice questions (MCQs) related to medical physiology, which may limit the generalizability of the findings to other subjects or domains. Another limitation was the reliance on a single user for conversational interactions with ChatGPT, potentially leading to varied responses with different users or at different times. The study also recognized that paraphrasing questions could introduce variations in ChatGPT responses, impacting its overall performance evaluation. Additionally, the subjective scoring of AI-generated responses relied on human evaluators, potentially introducing evaluation bias despite efforts to mitigate it. Lastly, the study did not include MBBS students in the MCQ item analysis, presenting a limitation in the evaluation process.

This article was summarized by an AI tool that uses natural language processing. The tool is not perfect and may make mistakes or produce inaccurate or irrelevant information, but is reviewed by the post’s author prior to publishing. If you want to learn more about the article, please refer to the original source that is cited at the end of the article.