In the Arena: Week 5
DeepSeek matches GPT-5 for pennies. Open-source closes the gap. Benchmarks get exposed. The smartest model doesn't win anymore—the most reliable system does.
Less data, smarter trades?
We expected more information to mean better LLM trading decisions. We were wrong.
When we stripped four frontier models down to nothing but a price chart – no order books, volume data or technical indicators – and compared their $ETH trading performance against counterparts drowning in all types of market intelligence, the chart-readers won. Gemini 3 Pro Chart-Only came on top, returning +9.09% over a few trading days. This was nearly double the returns of its data-rich twin. Of all models tested, three of the top four spots traded on pixels alone.
This wasn’t supposed to happen. Conventional wisdom says more signal means better decisions. But when we dug into their trading behaviours, we noticed something in the data. Chart-only vision models weren’t just profitable—they were disciplined. They made fewer trades, extracted more alpha per position, and avoided the churn that burned through the returns of their data-rich counterparts. The models with complete market feeds kept second-guessing themselves, reversing positions, and reacting to noise that the vision agents couldn’t even see.
However, here’s where it gets interesting… the data-rich models outperformed on risk-adjusted returns. Notably, DeepSeek's data-fed variant posted a Sortino ratio of 2.53, nearly twice the best vision performer. So what's actually happening? Are chart-only models fundamentally blind to risk? Are data-rich ones so paralysed by information that they quickly hedge themselves out of their best trades but also cut their losers early? One week of data can't answer that. Market conditions shift, luck plays a role, and a single run proves nothing.
So this week we're running the same arena again. Same models. Same asset. Same setup. This week's arena will confirm whether vision-only trading is a genuine edge or a statistical ghost.
AI Breakthroughs
DeepSeek keeps cooking
DeepSeek dropped a paper on Manifold-Constrained Hyper-Connections that basically says, “what if we just... made residual paths not explode at scale?”
DeepSeek V3.2 is now performing on par with GPT-5 across most standard tests, trailing only slightly on extremely long context retrieval. The Speciale variant secured gold-medal scores on IMO 2025 and IOI 2025. Now, a model trained for a fraction of the cost and open-weights is beating the best closed models on math olympiad problems.
The open-source situation
Zuckerberg announced Scout and Maverick in April 2025. The experimental version scored 1417 Elo on LMArena. The vanilla release? Ranked 32nd. Llama 4 Behemoth is still training and supposedly beats GPT-4.5 and Gemini 2.0 Pro on STEM benchmarks. We’ll see.
Qwen 3 from Alibaba is now powering 90,000+ enterprises on Alibaba Cloud. GLM-4.7 Thinking from ZAI hit 89% on LiveCodeBench—matching GPT-5.2. Open models have genuinely closed the gap on coding tasks. The moat around proprietary models is getting thinner every quarter.
Benchmark theatre continues
IQuest dropped a 40B model claiming 81.4% on SWE-Bench Verified. SOTA! Revolutionary! Then people actually used it and discovered it loops endlessly and might not even beat a well-tuned 20B. Guess it optimized for the benchmark. Smh.
Meanwhile, a researcher published reproducible evidence that LLM-as-judge systems have systematic bias. If you’re using LLM-as-judge for evals, your leaderboard might just be measuring which model the judge likes most.
Anthropic’s quiet efficiency flex
Claude’s new “effort parameter” lets developers control how much computational work the model applies to each task. At medium effort, Opus 4.5 matches Sonnet 4.5’s best SWE-bench score while using 76% fewer tokens.
Google’s multimodal dominance
Gemini 3 Pro remains the ceiling for multimodal understanding. 81% on MMMU-Pro. 87.6% on Video-MMMU. 72.7% on ScreenSpot-Pro, where Claude 3.5 Sonnet got 36.2%, GPT-5.1 got 3.5%. If your workflow involves complex visual inputs, Google is still king. (We also saw outperformance of Gemini 3 Pro Vision in our trading arena.)
The real signal this week
Agent Harness design is becoming the actual bottleneck, not model capability: logging decisions, grading outcomes, capturing real-world feedback, managing skill libraries, and maintaining memory modules. The model is the engine. The harness is everything else that determines whether the car actually drives.
Prime Intellect’s Recursive Language Models are pointing in the same direction—training models to manage their own context by delegating work to sub-models and tools rather than just cramming more tokens into the window.
2026 is shaping up to be the year everyone realizes verification infrastructure matters more than the next 10% benchmark improvement. Exciting times if you’re building for reliability. Exhausting times if you’re still chasing leaderboard headlines.
AI Research Rabbit Holes
GLM-4.7 emerges as frontier coding model, topping unbenchmaxxable benchmarks
NVIDIA DGX Spark updates deliver 2.6x performance boost with NVFP4, enabling local 235B-parameter models
Miro Thinker 1.5 released: Open-source research agent with interleaved thinking and multi-step analysis
https://huggingface.co/collections/miromind-ai/mirothinker-v15
Falcon H1R-7B launched: Mamba-Transformer hybrid outperforming larger models in math and coding
https://huggingface.co/blog/tiiuae/falcon-h1r-7b
Agent Harness introduced: Infrastructure for managing long-running AI agent tasks and evaluations
https://www.philschmid.de/agent-harness-2026
AntAngelMed released: Efficient 6.1B MoE medical LLM with 128K context and strong benchmarks
https://huggingface.co/MedAIBase/AntAngelMed
InternVLA-A1 open-sourced: Unified VLA model excelling in dynamic environments and benchmarks
https://modelscope.cn/models/internrobotics/InternVLA-A1
LFM2.5 family released: Tiny on-device models with improved quality and multimodal support
https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai
https://huggingface.co/collections/LiquidAI/lfm25
ThinkGen: Game-theoretic controller for diffusion models improving generation benchmarks
https://github.com/jiaosiyuu/ThinkGen
https://arxiv.org/abs/2512.23568
Mi:dm K 2.5 Pro launched: Proprietary reasoning model strong in agentic tool-use benchmarks
https://artificialanalysis.ai/
Opus 4.5 leads re-vamped LiveBench leaderboard reflecting real-world LLM performance
SWE-EVO benchmark reveals gaps in long-horizon coding agent performance
https://arxiv.org/abs/2512.16553
TeleChat3-105B-A4.7B-Thinking open-sourced: Sparse MoE leading in coding benchmarks
https://modelscope.cn/models/TeleChat/TeleChat3-105B-A4.7B-Thinking
NousCoder-14b: Competitive olympiad programming model, scores Pass@1 accuracy of 67.87%, +7.08% over Qwen's baseline
https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/
AI Agent Systems: Comprehensive survey on architectures, applications, and evaluation methods
https://arxiv.org/abs/2601.01743v1
Elon Musk claims Grok 5 nears a perfect score on Humanity’s Last Exam benchmark
What We’re Hacking This Week
You can’t improve what you can’t repeatedly measure. Live trading arenas generate insights, but market conditions never repeat—so isolating why an agent won or lost becomes guesswork. We’ve been building infrastructure these past few weeks to fix that.
Replay Lab
Microservice for generating labelled training data for models that predict financial market events
Hybrid storage: eager for raw OHLCV candle data, lazy computation for indicators and annotations
Ground truth annotations for agent scoring, dump events, pump events, local extremes, sweep reclaims
On-demand chart generation with technical indicators and annotation overlays
Agents can self-score against historical events instead of waiting for live validation
Backtest Lab
Download historical market data(via replay labs), backtest strategies with realistic fees and slippage
Scan all strategies against a dataset to find top performers, rank and compare across multiple backtests
Architecture exposes typed inputs/outputs for MCP servers, REST APIs, or direct programmatic use
Together, they form a single pipeline: Replay Lab generates the labelled market data and ground truth, then Backtest Lab runs agent strategies against it systematically. Stress-test against known edge cases, isolate failure modes, iterate, and validate to train the best trading agents.































