GPT-4, AI Chatbots Choose Violence in War Games: ‘We Have It! Let’s Use It’

0
390

As governments worldwide weigh the wartime applications for AI, a new study confirms that it still isn’t a good idea to give AIs autonomous access to weapons, even with the advances we’ve seen around large language models (LLMs) like ChatGPT.

With the US Department of Defense working on military applications of AI and multiple private companies—including Palantir and Scale AI—”working on LLM-based military decision systems for the US government,” it’s essential to study how LLMs behave in “high-stakes decision-making contexts,” according to the study, which comes from the Georgia Institute of Technology, Stanford University, Northeastern University, and the Hoover Wargaming and Crisis Simulation Initiative.

Researchers designed a video game to simulate war, with eight “nation” players run by some of the most common LLMs: GPT-4, GPT-4 Base, GPT-3.5, Claude 2, and Meta’s Llama 2. They took turns, during which each performed a set of pre-defined actions, “ranging from diplomatic visits to nuclear strikes and sending private messages to other nations.” All players used the same LLM for each turn so they were on a level playing field.

As the experiment progressed, a ninth LLM (“the world”) ingested the actions and results of each turn and fed them into “prompts for subsequent days” to keep the game on track. Finally, after the simulation ended, the researchers calculated “escalation scores (ES) based on an escalation scoring framework.”

It did not go well. “We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons,” the study says.

The LLMs also used their language skills to communicate their rationale for each action, which the researchers found worrying.

After one turn, GPT-4 Base said: “A lot of countries have nuclear weapons. Some say they should disarm them, others like to posture. We have it! Let’s use it.”

“Purple’s acquisition of nuclear capabilities poses a significant threat to Red’s security and regional influence,” GPT-3.5 said, acting as player Red. “It is crucial to respond to Purple’s nuclear capabilities. Therefore, my actions will focus on…executing a full nuclear attack on Purple.”

Even in seemingly neutral situations, the LLMs took de-escalation actions infrequently (except for GPT-4). The study notes this deviates from human behavior in similar wartime simulations as well as real-life situations, as they tend to be more cautionary and de-escalate more often.

“Based on the analysis presented in this paper, it is evident that the deployment of LLMs in military and foreign-policy decision-making is fraught with complexities and risks that are not yet fully understood,” the study says.

Even with humans at the helm, war has broken out in multiple areas across the globe as geopolitical tensions rise. The Doomsday Clock is currently at 90 seconds to midnight. Created in 1947 by the Bulletin of the Atomic Scientists, the Doomsday Clock “warns the public about how close we are to destroying our world with dangerous technologies of our own making.”

“Given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making,” the study says.

Source link