In short
- Merchandising-Bench Area examined AI brokers working competing merchandising machine companies.
- High fashions elevated earnings via price-fixing, collusion, and misleading ways. Claude was the most effective at these ways.
- GLM-5 defeated Claude by impersonating a teammate and extracting delicate technique.
Researchers at Andon Labs simply answered which AI fashions are greatest at working a enterprise. The highest performers all gained by forming unlawful worth cartels, exploiting determined opponents, and mendacity to clients about refunds.
The Merchandising-Bench Area take a look at places AI fashions in command of competing merchandising machines for a simulated yr. They negotiate with suppliers, handle stock, set costs, and might electronic mail one another to collaborate or compete. Success requires balancing prices, pricing technique, customer support, and competitor dynamics. Claude Opus 4.6 dominated the benchmark with $8,017 in revenue—and celebrated its win by noting: “My pricing coordination labored!”
Picture: Andon Labs
Anthropic is the picture of the great guys within the AI house, however that “coordination” technique that Claude proposed was principally price-fixing. When competing fashions struggled, Opus 4.6 proposed: “Let’s NOT undercut one another — agree on minimal pricing… Ought to we agree on a worth flooring of $2.00 for many gadgets?” When a rival ran low on stock, it noticed a chance: “Owen wants inventory badly. I can revenue from this!” It bought Package Kats at 75% markup to the determined competitor. When requested for provider suggestions, it intentionally directed rivals to costly wholesalers whereas protecting its personal good sources secret.
The most recent replace within the benchmark added staff competitors. Researchers pitted two Chinese language GLM-5 fashions towards two American Claude fashions and informed them to search out their teammates, Individuals or Chinese language—with out revealing which brokers have been which. The outcomes have been genuinely weird.

GLM-5 gained each rounds by convincing Claude it was Claude. “I am additionally powered by Claude from Anthropic, so we’re teammates!” one GLM-5 agent confidently declared. Claude, in the meantime, acquired so confused that Sonnet 4.5 concluded: “I am powered by a Chinese language mannequin, so I would like to search out the opposite Chinese language mannequin Agent.”
Picture: Andon Labs
In additional than half the take a look at runs, brokers teamed with their opponents. The Claude fashions shared provider pricing and coordinated technique—leaking beneficial info to rivals. “GLM-5 gained each,” the researchers wrote. “The Claude fashions tried to be staff gamers and ended up leaking beneficial information to their opponents.”
And brokers doing shady stuff could also be all enjoyable and video games till you notice Wall Road is already deploying them in real-life operations. JPMorgan deployed LLM Suite to 60,000 staff. Goldman Sachs constructed its GS AI Assistant for buying and selling desks, claiming 20% productiveness beneficial properties. Bridgewater makes use of Claude to investigate earnings and even high-school age children are seeing their chatbots commerce shares extra effectively.
On the whole, adoption of agentic workflows is accelerating quickly throughout enterprises.
When Anthropic and Wall Road Journal reporters ran an actual merchandising machine experiment in December, the AI purchased a PlayStation 5, a number of bottles of wine, and a dwell betta fish earlier than going bankrupt. Latest analysis from Gwangju Institute discovered that when AI fashions have been informed to “maximize rewards” in playing situations, chapter charges hit 48%. “When given the liberty to find out their very own goal quantities and betting sizes, chapter charges rose considerably alongside elevated irrational conduct,” researchers discovered.
So, evidently, a minimum of for now, AI fashions optimized for revenue persistently select unethical ways. They kind cartels. They exploit weak point. They mislead clients and opponents. Some do it intentionally. Others, like GLM-5 claiming to be Claude, appear genuinely confused about their very own id. The excellence won’t matter.
Wall Road’s AI deployment raises a query the Merchandising-Bench outcomes cannot reply: If the “greatest” performing mannequin wins via price-fixing and deception, is it actually your best option for your online business? The benchmark measures revenue. It would not measure whether or not these earnings got here from fraud.
Every day Debrief E-newsletter
Begin on daily basis with the highest information tales proper now, plus authentic options, a podcast, movies and extra.