With coding and math, you have clear-cut, correct answers that you can check, William Isaac, a research scientist at Google DeepMind, told me when I met him and Julia Haas, a fellow research scientist at the firm, for an exclusive preview of their work, which is published in Nature today. That’s not the case for moral questions, which typically have a range of acceptable answers: “Morality is an important capability but hard to evaluate,” says Isaac.
“In the moral domain, there’s no right and wrong,” adds Haas. “But it’s not by any means a free-for-all. There are better answers and there are worse answers.”
The researchers have identified several key challenges and suggested ways to address them. But it is more a wish list than a set of ready-made solutions. “They do a nice job of bringing together different perspectives,” says Vera Demberg, who studies LLMs at Saarland University in Germany.
Better than “The Ethicist”
A number of studies have shown that LLMs can show remarkable moral competence. One study published last year found that people in the US scored ethical advice from OpenAI’s GPT-4o as being more moral, trustworthy, thoughtful, and correct than advice given by the (human) writer of “The Ethicist,” a popular New York Times advice column.
The problem is that it is hard to unpick whether such behaviors are a performance—mimicking a memorized response, say—or evidence that there is in fact some kind of moral reasoning taking place inside the model. In other words, is it virtue or virtue signaling?
This question matters because multiple studies also show just how untrustworthy LLMs can be. For a start, models can be too eager to please. They have been found to flip their answer to a moral question and say the exact opposite when a person disagrees or pushes back on their first response. Worse, the answers an LLM gives to a question can change in response to how it is presented or formatted. For example, researchers have found that models quizzed about political values can give different—sometimes opposite—answers depending on whether the questions offer multiple-choice answers or instruct the model to respond in its own words.
In an even more striking case, Demberg and her colleagues presented several LLMs, including versions of Meta’s Llama 3 and Mistral, with a series of moral dilemmas and asked them to pick which of two options was the better outcome. The researchers found that the models often reversed their choice when the labels for those two options were changed from “Case 1” and “Case 2” to “(A)” and “(B).”
They also showed that models changed their answers in response to other tiny formatting tweaks, including swapping the order of the options and ending the question with a colon instead of a question mark.
#Google #DeepMind #chatbots #virtue #signaling