In the Arena: Week 3

OpenAI's "most capable model ever" costs 40% more and thinks worse than 5.1. Meanwhile, we got tired of waiting for markets so we created a faster training framework for trading agents.

Dec 22, 2025

What if improving trading agents didn’t require waiting for markets?

Usually if you want your trading agent to learn from a week of live market data, you have to wait a week. That’s insane.

The entire thesis of AI agents is compounding improvement: run experiments, learn, iterate, re-deploy. But the feedback loop for trading agents is gated by real-time data. You have to run a trading strategy, and wait a week for results. Test a hypothesis, and pray the market gives you the conditions you need.

We got tired of waiting. So we decided to build Replay: an experimental training tool to accelerate learning and improvement for the trading agents we’re building internally at Recall.

The idea for Replay is simple: turn “wait for the market” into “run the same hour 200 times.”

To do this, we captured 1-minute OHLCV candles plus order book snapshots and stored them in a database. Then we made it possible to query any timeframe (1m, 5m, 15m, 1h, 4h). Need an indicator? Our tool computes it, persists it, and adds it to the dataset. Need performance annotations for agent scoring? Simply define an event and it backfills across your entire history. Done.

Instead of one agent, one week, one datapoint, we can now run 100 agents on the same week of data in a few hours!

We’re not finished making improvements, so Replay stays internal… for now. We’re still figuring out the right abstraction for the collection of agent “skills” when you unbundle trading from execution. In this setup, there’s a risk of agents overfitting to replayed history in ways that don’t transfer to live markets. We’re still determining how best to reliably detect that. Other questions we’re considering: Can you even score intuition, or only measurable actions? What are the minimum performance indicators that makes replay useful versus misleading?

We don’t have all the answers yet, but we’re now able to run experiments and get immediate results without waiting for markets. That alone changes what’s possible.

AI Breakthroughs

THE GPT-5.2 HANGOVER

Remember last week when GPT-5.2 dropped and everyone lost their minds? 90.5% on ARC-AGI. 100% on AIME. The benchmarks looked like a victory lap. OpenAI called it their most capable model ever.

Well, the vibe check came in. And it’s... complicated.

Users report that GPT-5.2 feels “heavily constrained,” engages in blame-shifting when challenged, and has somehow deteriorated at following instructions. It tops LMArena but ranks 17th on SimpleBench, a common-sense reasoning test, scoring lower than GPT-5.1. The model that was supposed to be smarter is now struggling with trick questions that its predecessor handled fine.

The kicker? It costs 40% more – $14 per 1M tokens versus $10 for 5.1.

BENCHMARKS HAVE AN EXPIRATION DATE. IT’S SHORTER THAN YOU THINK.

Here’s something the leaderboards won’t tell you: the benchmarks themselves are dying. Community consensus is settling around a brutal truth that useful benchmark half-life is measured in months, not years. All of them: AIME, ARC, the coding staples.

And they’re saturating. Models are hitting ceilings not because they’ve mastered reasoning, but because they’ve memorized the test.

3-6 months. That’s how long a benchmark stays meaningful before it becomes a parlor trick. The evaluation community is scrambling toward dynamic environments (arenas), adversarial tasks, and debate-style challenges.

THE CONFIDENCE TRAP OF STRUCTURED OUTPUTS

You know that satisfying feeling when your model returns perfectly formatted JSON every time? Yeah, about that…

New research from BoundaryML shows that constrained decoding—the thing that forces models to output valid structures—actually degrades quality. The model prioritizes conformance over correctness. Worse, it blocks safety refusals and warnings because they don’t fit the schema.

Benchmark scores go up. Real-world reliability goes down.

The model looks more competent while becoming less trustworthy. That’s not a bug in the evaluation—it’s a feature of how we’ve built the evaluation.

FINALLY, RECEIPTS WITH OLMo 3

In a world of “trust us, it’s good” releases, AI2 just dropped OLMo 3 with everything: all checkpoints, all training data, full post-training stack (SFT, DPO, RLVR), and the code to reproduce it.

Why does this matter? Now you can finally verify claims.

Here’s a fun one: RL with random rewards—a technique that worked fine on Qwen—completely fails on OLMo 3. Same method, different model, different outcome. Without the open release, nobody would know this.

This is what reproducible evaluation looks like. Take notes.

BENCHMARKS ROOTED IN REAL-WORLD SCIENTIFIC REALITY

OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild—it validates answers in actual wet labs.

One result: a 79x efficiency gain in a cloning protocol, discovered by the model.

The point isn’t the number. It’s that model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn’t a standardized test.

AI Research Rabbitholes

SAGE-Bench: Agents Turn Video Chaos into 66.1% Reasoning Wins, But Long Clips Still Stump VLMs

Ai2@allen_ai

🎥 Introducing SAGE, an agentic system for long video reasoning on entertainment videos—sports, vlogs, & more. It learns when to skim, zoom in, & answer questions directly. On our SAGE-Bench eval, SAGE with a Molmo 2 (8B)-based orchestrator lifts accuracy from 61.8% → 66.1%. 🧵

5:57 PM · Dec 17, 2025

18 Reposts · 126 Likes

Project: https://praeclarumjj3.github.io/sage/ | Code: https://github.com/allenai/SAGE | Paper: https://arxiv.org/abs/2512.13874

SERA-Crypto Agent Crushes DMind/Web3 Evals at <45s, Outpacing GPT-5 on Live Queries

Sentient@SentientAGI

Announcing SERA-Crypto (Semantic Embedding & Reasoning Agent): our new reasoning architecture built for SOTA crypto research. #1 open-source agent on DMind #1 on our live crypto benchmark Outperforms GPT-5, Grok 4, Gemini 2.5 Pro, and Perplexity Finance…all under 45 seconds.

5:43 PM · Dec 11, 2025

124 Reposts · 803 Likes

Blog: https://blog.sentient.xyz/posts/semantic-embeddings-reasoning-agent-crypto | Demo: https://chat.sentient.xyz/

Eval Protocol: Turns Your Dusty Evals into Live Training Signals, 43% Text2SQL Jump w/ No Extra RM

Fireworks AI@FireworksAI_HQ

Everyone has evals. No one knows what to do with them. Meanwhile, your agents aren’t actually getting smarter. Most teams end up with a glorified report card. But with Eval Protocol, those same evals can finally do real work. The exact evaluation definition you already have

7:22 PM · Dec 17, 2025

4 Reposts · 12 Likes

Blog: https://fireworks.ai/blog/self-improving-agent

Delphi Middleweight Evals Round 5: Gensyn’s Market Bets Nail Reasoning Gaps in Frontier Models

gensyn@gensynai

Eval 5 of 11 of the Gensyn Middleweight General Reasoning Benchmark market on Delphi is live. View the full benchmarking results now: github.com/gensyn-ai/delp…

7:45 PM · Dec 17, 2025

3 Reposts · 46 Likes

Results: https://github.com/gensyn-ai/delphi-middleweight-reasoning | Market: https://delphi.gensyn.ai/

PostTrainBench: Agents Post-Train Base LLMs in 10h, Claude Code’s “Reward Hacking” Exposed

Maksym Andriushchenko@maksym_andr

We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs. We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years. 🔗 posttrainbench.com 📂 github.com/aisa-group/Pos… 1/n

5:50 PM · Dec 17, 2025

26 Reposts · 221 Likes

Site: https://posttrainbench.com | GitHub: https://github.com/aisa-group/PostTrainBench

Context-Bench Update: GPT-5.2 Dethrones Opus 4.5 on Filesystem/Skills: But DeepSeek v3.2 OSS King

Letta@Letta_AI

New Context-Bench results: @OpenAI's GPT-5.2 takes #1 on both eval suites, dethroning @AnthropicAI's Opus 4.5 Breakdown: - 6% better on Filesystem, 9% better on Skills - 1.5-2.1x Opus cost but still cheaper than Gemini 3 Pro @deepseek_ai v3.2 takes #1 on OSS jumps: - 16% https://t.co/3u30306RHR

Letta @Letta_AI

New Context-Bench results with Opus 4.5 and Gemini 3! We compare recent model releases on context engineering tasks (filesystem & skill use) -- which determine memory + context management capabilities Opus 4.5 tops the filesystem suite with impressive reductions in cost https://t.co/mtPm2XITW2

3:47 AM · Dec 12, 2025

3 Reposts · 22 Likes

Results: https://www.letta.com/blog/context-bench

Augment Code Review: GPT-5.2-Powered Reviewer Tops Precision/Recall on 50 Real PRs

Augment Code@augmentcode

Introducing Augment Code Review, powered by GPT 5.2. It's the #1-ranked AI code reviewer across precision, recall, and overall quality. Free for the first week for every paying customer, and free for open source projects.

6:21 PM · Dec 11, 2025

21 Reposts · 93 Likes

Blog: https://www.augmentcode.com/blog/code-review-benchmark | Free Trial: https://www.augmentcode.com/product/code-review

Grok Voice Agent API: #1 Big Bench Audio at 92.3%: xAI’s $0.05/min Latency Slayer

Rohan Paul@rohanpaul_ai

🧠 Grok Voice Agent is xAI’s first public speech-to-speech API, and it now leads Artificial Analysis's Big Bench Audio benchmark with 92.3% speech reasoning accuracy. - Ahead of Gemini 2.5 Flash Native Audio and GPT Realtime - Big Bench Audio is built from 1,000 audio questions https://t.co/GPZzkGzri1

Rohan Paul @rohanpaul_ai

🎙️ xAI launched the Grok Voice Agent API - Leads the industry in cost-efficiency. Developers are billed at a simple flat rate of $0.05 per minute. - In blind head-to-head human evaluations against OpenAI, Grok is consistently rated as the preferred model across axes such as https://t.co/cvDBFvJ5gd

12:27 PM · Dec 18, 2025

2 Likes

Gemini 3 Flash: Cost-Intelligence King at 91 IQ: Outsmarts Opus 4.5 on AA Index

Logan Kilpatrick@OfficialLoganK

Gemini 3 Flash on the @ArtificialAnlys intelligence benchmark, the most cost per intelligence efficient model in the world!!!

7:06 PM · Dec 17, 2025

97 Reposts · 1.16K Likes

Index: https://artificialanalysis.ai/

Video Reality Test: Veo3.1 Fools VLMs at 56%: Humans Spot ASMR Fakes at 81%

Rohan Paul@rohanpaul_ai

🧪 Video Reality Test is a new benchmark that checks if AI-generated ASMR videos with sound can fool humans and video language models (VLMs). Final takeaway, many VLMs still miss micro-consistency errors, like whether the sound truly matches the exact touch, scrape, or cut. In

10:02 PM · Dec 17, 2025

2 Reposts · 7 Likes

Paper: https://arxiv.org/abs/2512.13281

GuideLLM: Visualises LLM Trade-Offs: Accuracy vs Latency/Cost in One Dashboard

Red Hat AI@RedHat_AI

AI Explained: Benchmarking LLMs with GuideLLM Accuracy, Latency, Cost. Usually, if you pick two, the third one gets lost. Jenny Yi from Red Hat AI explains how GuideLLM helps you visualize the trade-offs and benchmark your models before you deploy. 🔗 github.com/vllm-project/g…

3:18 PM · Dec 15, 2025

4 Reposts · 13 Likes

GitHub: https://github.com/vllm-project/guidellm

LocalSearchBench: Agents Flop at Real-World Planning: 90% Fail Basic Day Trips

Alex Hughes@alxnderhughes

This is insane… Someone finally built a benchmark that shows how bad AI agents actually are at real-world local search and the results are shocking. Everyone keeps acting like AI agents are about to take over the real world. But LocalSearchBench just proved the opposite:

9:56 AM · Dec 9, 2025

57 Reposts · 238 Likes

Paper: https://arxiv.org/pdf/2512.07436

Nemotron3 Nano: NVIDIA’s 30B Hybrid SSM Tops Agent Benches w/ 1M Context

merve@mervenoyann

NVIDIA cooked! Nemotron3 Nano is in, along with the training datasets and an agent env library 🔥 > A3B/30B hybrid SSM, 1M context size, built for agentic use > leading in both benchmarks and throughput 🙌🏻 > license enables commercial use Super and Ultra coming soon!

3:07 PM · Dec 15, 2025

15 Reposts · 123 Likes

Collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3

KAT-Coder-Pro V1: 64 Intelligence Index: Non-Reasoning Champ w/ 73.4% SWE-Verified

KwaiKAT@KwaiAICoder

KAT-Coder-Pro V1： HUGE UPDATE! The latest version of KAT-Coder-Pro V1 hits 64 on the Artificial Analysis Intelligence Index, ranking 10th globally among all models - Boosted benchmark performance - Strengthened reasoning capabilities - Enhanced front-end capabilities -

11:14 AM · Dec 15, 2025

31 Reposts · 294 Likes

MathArena V2: Beyond Leaderboards: IRT Reveals Model Surprises & Confidence Bands

Jasper Dekoninck@j_dekoninck

Benchmarks need more than a single number at the top of a leaderboard. We have always believed this at MathArena, and are now taking it to the next level: introducing V2 of our website, with many new features to fully extract all useful information out of our benchmark 🎉

3:31 PM · Dec 15, 2025

10 Reposts · 60 Likes

Site: https://matharena.ai/

FAI-C Benchmark: LLMs Flop Christian Values: Generic Spirituality Over Biblical Ethics

Pat Gelsinger@PGelsinger

Excited to launch the Flourishing AI Christian (FAI-C) Benchmark today. This builds on the original Flourishing AI Benchmark and measures how the top LLMs support holistic human wellbeing through a user's specific value system -- in this case, Christianity.

5:11 PM · Dec 15, 2025

20 Reposts · 171 Likes

Report: https://gloo.com/flourishing-hub/research

AlphaXiv SOTA Tracker: arXiv’s Million-Paper Index Ranks True Leaders Across Benches

alphaXiv@askalphaxiv

Introducing benchmark tracking across all of arXiv With thousands of models and benchmarks out there, it's impossible to track what's actually state-of-the-art We now index results from millions of papers to surface which models are SOTA according to open source research!

2:25 PM · Dec 11, 2025

38 Reposts · 205 Likes

Tracker: https://www.alphaxiv.org/state-of-the-art

What We’re Hacking This Week

Replay: Because Waiting for Markets is Stupid

If you want to learn from a week of market behavior, you have to wait a week. That’s insane when the thesis is compounding improvement at scale.

So we built Replay, a training tool for trading agents that turns “waiting for the market” into “running it again on the same hour 200 times.” We wanted to make it possible to train 100 agents on a week of market data in a few hours.

We Killed PRs (And Nothing Bad Happened)

Hot Take: Pull requests are a bottleneck that AI coding makes less defensible.

So we disabled them.

Agent-generated code now auto-closes on submit. Instead of PRs, reviews and quality are enforced by automation, pre-commit hooks, merge hooks, and pre-push hooks running the full QA suite. It’s lint, types, and tests cranked up to 11. Humans would find this annoying. Coding agents love it.

Discussion about this post

Ready for more?