Zero-Shot Medical Reasoning with LLMs

Working together improves LLM performance

MEDAGENTS: Large Language Models as Collaborators for Zero-shot Medical Reasoning

Xiangru Tang , Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein

This paper discusses the challenges faced by large language models (LLMs) in adapting to the medical field and proposes a multi-disciplinary collaboration (MC) framework to address these challenges. The framework leverages role-playing LLM-based agents to participate in collaborative multi-round discussions to enhance LLM proficiency and reasoning capabilities. The training-free and interpretable framework encompasses gathering domain experts, proposing individual analyses, summarizing these analyses into a report, iterating over discussions until a consensus is reached, and making a decision.

Discussion of the MC Framework

The paper focuses on the zero-shot scenario and shows that the proposed MC framework excels at mining and harnessing medical expertise in LLMs and extends its reasoning abilities. The research also evaluates the framework's performance on medical question-answering tasks and identifies four common error types through human evaluation. The proposed MC framework outperforms zero-shot baselines and demonstrates comparable performance with a strong few-shot baseline.

Evaluation of the MC Framework

The paper also highlights the challenges of obtaining high-quality instruction-tuning data in the medical domain and the limitations of existing instruction-tuning methods. It emphasizes the success of LLM-based agents in multi-agent collaboration, which brings the model's embedded expertise to the fore and enhances its reasoning capabilities over multi-round interactions.


In conclusion, the paper's major contributions include proposing a novel multi-disciplinary collaboration framework for question-answering tasks in the medical domain, presenting experimental results demonstrating the effectiveness of the MC framework, identifying and categorizing common error types through human evaluation, and shedding light on potential future studies to address the identified limitations and enhance the framework's proficiency and reliability.

This article was summarized by an AI tool that uses natural language processing. The tool is not perfect and may make mistakes or produce inaccurate or irrelevant information, but is reviewed by the post’s author prior to publishing. If you want to learn more about the article, please refer to the original source that is cited at the end of the article.