The debate over generative artificial intelligence (AI) between open source and closed source has been a hot topic. Many open-source large language models (LLMs) are constantly being produced, led by the prestigious open-source Meta’s Llama 2. On the closed-source side are the commercial programs OpenAI’s GPT-4 and Anthropic’s Claude 2.
Researchers at Pepperdine University and the University of California at Los Angeles published a recent study which compared these programs based on their performance in answering nephrology questions. GPT-4 far outperformed Llama 2 and other open-source models in this area according to the study.
The method used for the comparison was what is called “zero-shot” tasks, where a language model is used with no modifications and no examples of right and wrong answers. The models were fed 858 nephrology questions, which required significant data preparation to convert the questions into prompts that could be fed into the language models. Then, automatic techniques had to be developed to compare the answers from the models to the correct answers and score the results.
The authors suspect that one of the reasons for GPT-4’s superior performance is that it was trained on third-party medical data, which is not publicly available. They conclude that the access to high-quality medical training data will likely remain a key factor that determines the performance of specific models in the future. However, efforts are underway, such as federated training, which may help improve open-source models by training them locally on private data and then contributing the results to an aggregate effort in the public cloud. Other endeavors like Google DeepMind’s MedPaLM and “retrieval-augmented generation” also offer solutions to improve the performance of AI language models.