But new benchmarks are aiming to better measure the models’ ability to do legal work in the real world. The Professional Reasoning Benchmark, published by ScaleAI in November, evaluated leading LLMs on legal and financial tasks designed by professionals in the field. The study found that the models have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on the most difficult legal problems, meaning it met just over a third of possible points on the evaluation criteria. The models frequently made inaccurate legal judgments, and if they did reach correct conclusions, they did so through incomplete or opaque reasoning processes.
“The tools actually are not there to basically substitute [for] your lawyer,” says Afra Feyza Akyurek, the lead author of the paper. “Even though a lot of people think that LLMs have a good grasp of the law, it’s still lagging behind.”
The paper builds on other benchmarks measuring the models’ performance on economically valuable work. The AI Productivity Index, published by the data firm Mercor in September and updated in December, found that the models have “substantial limitations” in performing legal work. The best-performing model scored 77.9% on legal tasks, meaning it satisfied roughly four out of five evaluation criteria. A model with such a score might generate substantial economic value in some industries, but in fields where errors are costly, it may not be useful at all, the early version of the study noted.
Professional benchmarks are a big step forward in evaluating the LLMs’ real-world capabilities, but they may still not capture what lawyers actually do. “These questions, although more challenging than those in past benchmarks, still don’t fully reflect the kinds of subjective, extremely challenging questions lawyers tackle in real life,” says Jon Choi, a law professor at the University of Washington School of Law, who coauthored a study on legal benchmarks in 2023.
Unlike math or coding, in which LLMs have made significant progress, legal reasoning may be challenging for the models to learn. The law deals with messy real-world problems, riddled with ambiguity and subjectivity, that often have no right answer, says Choi. Making matters worse, a lot of legal work isn’t recorded in ways that can be used to train the models, he says. When it is, documents can span hundreds of pages, scattered across statutes, regulations, and court cases that exist in a complex hierarchy.
But a more fundamental limitation might be that LLMs are simply not trained to think like lawyers. “The reasoning models still don’t fully reason about problems like we humans do,” says Julian Nyarko, a law professor at Stanford Law School. The models may lack a mental model of the world—the ability to simulate a scenario and predict what will happen—and that capability could be at the heart of complex legal reasoning, he says. It’s possible that the current paradigm of LLMs trained on next-word prediction gets us only so far.
#coming #lawyers #jobs #anytime