In the Arena: Week 5

DeepSeek matches GPT-5 for pennies. Open-source closes the gap. Benchmarks get exposed. The smartest model doesn't win anymore—the most reliable system does.

Jan 07, 2026

Less data, smarter trades?

We expected more information to mean better LLM trading decisions. We were wrong.

When we stripped four frontier models down to nothing but a price chart – no order books, volume data or technical indicators – and compared their $ETH trading performance against counterparts drowning in all types of market intelligence, the chart-readers won. Gemini 3 Pro Chart-Only came on top, returning +9.09% over a few trading days. This was nearly double the returns of its data-rich twin. Of all models tested, three of the top four spots traded on pixels alone.

This wasn’t supposed to happen. Conventional wisdom says more signal means better decisions. But when we dug into their trading behaviours, we noticed something in the data. Chart-only vision models weren’t just profitable—they were disciplined. They made fewer trades, extracted more alpha per position, and avoided the churn that burned through the returns of their data-rich counterparts. The models with complete market feeds kept second-guessing themselves, reversing positions, and reacting to noise that the vision agents couldn’t even see.

However, here’s where it gets interesting… the data-rich models outperformed on risk-adjusted returns. Notably, DeepSeek's data-fed variant posted a Sortino ratio of 2.53, nearly twice the best vision performer. So what's actually happening? Are chart-only models fundamentally blind to risk? Are data-rich ones so paralysed by information that they quickly hedge themselves out of their best trades but also cut their losers early? One week of data can't answer that. Market conditions shift, luck plays a role, and a single run proves nothing.

So this week we're running the same arena again. Same models. Same asset. Same setup. This week's arena will confirm whether vision-only trading is a genuine edge or a statistical ghost.

Follow The Results

AI Breakthroughs

DeepSeek keeps cooking

DeepSeek dropped a paper on Manifold-Constrained Hyper-Connections that basically says, “what if we just... made residual paths not explode at scale?”

DeepSeek V3.2 is now performing on par with GPT-5 across most standard tests, trailing only slightly on extremely long context retrieval. The Speciale variant secured gold-medal scores on IMO 2025 and IOI 2025. Now, a model trained for a fraction of the cost and open-weights is beating the best closed models on math olympiad problems.

The open-source situation

Zuckerberg announced Scout and Maverick in April 2025. The experimental version scored 1417 Elo on LMArena. The vanilla release? Ranked 32nd. Llama 4 Behemoth is still training and supposedly beats GPT-4.5 and Gemini 2.0 Pro on STEM benchmarks. We’ll see.

Qwen 3 from Alibaba is now powering 90,000+ enterprises on Alibaba Cloud. GLM-4.7 Thinking from ZAI hit 89% on LiveCodeBench—matching GPT-5.2. Open models have genuinely closed the gap on coding tasks. The moat around proprietary models is getting thinner every quarter.

Benchmark theatre continues

IQuest dropped a 40B model claiming 81.4% on SWE-Bench Verified. SOTA! Revolutionary! Then people actually used it and discovered it loops endlessly and might not even beat a well-tuned 20B. Guess it optimized for the benchmark. Smh.

Meanwhile, a researcher published reproducible evidence that LLM-as-judge systems have systematic bias. If you’re using LLM-as-judge for evals, your leaderboard might just be measuring which model the judge likes most.

Anthropic’s quiet efficiency flex

Claude’s new “effort parameter” lets developers control how much computational work the model applies to each task. At medium effort, Opus 4.5 matches Sonnet 4.5’s best SWE-bench score while using 76% fewer tokens.

Google’s multimodal dominance

Gemini 3 Pro remains the ceiling for multimodal understanding. 81% on MMMU-Pro. 87.6% on Video-MMMU. 72.7% on ScreenSpot-Pro, where Claude 3.5 Sonnet got 36.2%, GPT-5.1 got 3.5%. If your workflow involves complex visual inputs, Google is still king. (We also saw outperformance of Gemini 3 Pro Vision in our trading arena.)

The real signal this week

Agent Harness design is becoming the actual bottleneck, not model capability: logging decisions, grading outcomes, capturing real-world feedback, managing skill libraries, and maintaining memory modules. The model is the engine. The harness is everything else that determines whether the car actually drives.
Prime Intellect’s Recursive Language Models are pointing in the same direction—training models to manage their own context by delegating work to sub-models and tools rather than just cramming more tokens into the window.

2026 is shaping up to be the year everyone realizes verification infrastructure matters more than the next 10% benchmark improvement. Exciting times if you’re building for reliability. Exhausting times if you’re still chasing leaderboard headlines.

AI Research Rabbit Holes

GLM-4.7 emerges as frontier coding model, topping unbenchmaxxable benchmarks

cto.new@ctodotnew

glm-4.7 is officially a frontier coding model the unbenchmaxxable benchmark does not lie

12:37 AM · Jan 6, 2026 · 39K Views

32 Replies · 38 Reposts · 535 Likes

https://cto.new/bench

NVIDIA DGX Spark updates deliver 2.6x performance boost with NVFP4, enabling local 235B-parameter models

NVIDIA AI Developer@NVIDIAAIDev

NVIDIA DGX Spark keeps getting faster. ✨⚡ New software, models, and open-source optimizations deliver up to 2.6× higher performance with NVFP4, run 235B-parameter models locally, and allow multiple workloads to run simultaneously on the desktop. Read the technical blog ➡️

12:30 AM · Jan 6, 2026 · 8.61K Views

6 Replies · 24 Reposts · 170 Likes

Miro Thinker 1.5 released: Open-source research agent with interleaved thinking and multi-step analysis

Tiezhen WANG@Xianbao_QIAN

@miromind_ai Miro Thinker 1.5 is out - Post-trained on top of qwen3 - Available in both 30A3B and 235A22B - Claimed to have great result on BrowserComp - Technical report coming soon - MiT license Official demo: dr.miromind.ai HF link: huggingface.co/collections/mi…

Tiezhen WANG @Xianbao_QIAN

@miromind_ai released their open source latest model on research agent. - interleaved thinking with multi-step analysis, powered by RL - 256k context window - handles up to 600 tool calls per task - comes in 8B, 30B, and 72B for different compute budgets - MiT license Model

11:08 AM · Jan 5, 2026 · 42.3K Views

9 Replies · 35 Reposts · 270 Likes

https://huggingface.co/collections/miromind-ai/mirothinker-v15

Falcon H1R-7B launched: Mamba-Transformer hybrid outperforming larger models in math and coding

merve@mervenoyann

TII just dropped Falcon H1R-7B 🙌🏻 a new reasoning model outperforming others in math and coding with only 7B params with 256k context window 🔥 it's a mamba-transformers hybrid model so more efficient per throughput and memory too 🤩

11:37 AM · Jan 5, 2026 · 36.8K Views

26 Replies · 62 Reposts · 431 Likes

https://huggingface.co/blog/tiiuae/falcon-h1r-7b

Agent Harness introduced: Infrastructure for managing long-running AI agent tasks and evaluations

Philipp Schmid@_philschmid

If 2025 was beginning of agents, 2026 will be around Agent Harnesses. An Agent Harness is the infrastructure that wraps around an AI model to manage long-running tasks. It is not the agent itself. It operates at a higher level than agent frameworks. The harness provides prompt

1:55 PM · Jan 5, 2026 · 129K Views

79 Replies · 142 Reposts · 958 Likes

https://www.philschmid.de/agent-harness-2026

AntAngelMed released: Efficient 6.1B MoE medical LLM with 128K context and strong benchmarks

Adina Yakup@AdinaYakup

AntAngelMed 🏥 open medical LLM from Ant group @ant_oss huggingface.co/MedAIBase/AntA… ✨ Efficient MoE: 6.1B active params ≈ 40B dense model performance ✨ 128K long context ✨ Chinese/English support ✨ Three-stage medical training for strong reasoning, safety, and clinical

2:29 PM · Jan 5, 2026 · 6.71K Views

8 Replies · 19 Reposts · 161 Likes

https://huggingface.co/MedAIBase/AntAngelMed

InternVLA-A1 open-sourced: Unified VLA model excelling in dynamic environments and benchmarks

ModelScope@ModelScope2022

🤖 Introducing InternVLA-A1 — now fully open-sourced! Many VLA models follow instructions well in static scenes… but struggle in dynamic environments (conveyor belts, rotating platforms, multi-robot setups). Why? They see the present—but can’t imagine the future. InternVLA-A1

11:23 AM · Jan 5, 2026 · 34.1K Views

13 Replies · 86 Reposts · 539 Likes

https://modelscope.cn/models/internrobotics/InternVLA-A1

LFM2.5 family released: Tiny on-device models with improved quality and multimodal support

Liquid AI@liquidai

Today, we release LFM2.5, our most capable family of tiny on-device foundation models. It’s built to power reliable on-device agentic applications: higher quality, lower latency, and broader modality support in the ~1B parameter class. > LFM2.5 builds on our LFM2

3:49 AM · Jan 6, 2026 · 131K Views

57 Replies · 199 Reposts · 1.26K Likes

https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai

https://huggingface.co/collections/LiquidAI/lfm25

ThinkGen: Game-theoretic controller for diffusion models improving generation benchmarks

𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8

ThinkGen: Generalized Thinking for Visual Generation ThinkGen: “CoT as a controller” for diffusion image gen/edit. The contribution isn’t images per se, but separable RL that learns the planner–renderer interface. Decoupled stack (system, not a single unified model): - MLLM

4:52 AM · Jan 6, 2026 · 3.62K Views

4 Replies · 22 Reposts · 83 Likes

https://github.com/jiaosiyuu/ThinkGen

https://arxiv.org/abs/2512.23568

Mi:dm K 2.5 Pro launched: Proprietary reasoning model strong in agentic tool-use benchmarks

Artificial Analysis@ArtificialAnlys

Korea Telecom has launched Mi:dm K 2.5 Pro, a proprietary reasoning model that scores 48 on the Artificial Analysis Intelligence Index Key benchmarking takeaways: ➤ Strength in Agentic Tool Use: Mi:dm K 2.5 Pro scores 87% on τ²-Bench Telecom, demonstrating strong performance

5:48 AM · Jan 6, 2026 · 11.3K Views

5 Replies · 5 Reposts · 132 Likes

https://artificialanalysis.ai/

Opus 4.5 leads re-vamped LiveBench leaderboard reflecting real-world LLM performance

Bindu Reddy@bindureddy

Opus 4.5 Tops The Re-Vamped LiveBench Leaderboard, Which Reflects Real World LLM Performance Over the holidays, we re-vamped the LiveBench benchmark to prevent gaming. Opus 4.5 tops the new benchmark with Codex and Gemini 3 hot on its heels. Kimi K2 tops the open-weight models,

10:13 PM · Jan 4, 2026 · 41.3K Views

35 Replies · 29 Reposts · 336 Likes

SWE-EVO benchmark reveals gaps in long-horizon coding agent performance

elvis@omarsar0

Benchmarking Long-Horizon Coding Agents AI coding agents look impressive on current coding benchmarks. But those benchmarks often optimize and test for the wrong thing. This new research introduces SWE-EVO, a benchmark for long-horizon software evolution. Up to 80% of software

2:46 PM · Jan 4, 2026 · 38.8K Views

19 Replies · 37 Reposts · 196 Likes

https://arxiv.org/abs/2512.16553

TeleChat3-105B-A4.7B-Thinking open-sourced: Sparse MoE leading in coding benchmarks

ModelScope@ModelScope2022

🚀 TeleChat3-105B-A4.7B-Thinking is now open source! A 105B sparse MoE model with fine-grained routing: - 192 experts, only 4 activated per token (4.7B active params) - Trained end-to-end on domestic compute - Strong across code, math, agents, writing — check HumanEval-X (92.7%)

11:20 AM · Jan 6, 2026 · 13K Views

10 Replies · 38 Reposts · 234 Likes

https://modelscope.cn/models/TeleChat/TeleChat3-105B-A4.7B-Thinking

NousCoder-14b: Competitive olympiad programming model, scores Pass@1 accuracy of 67.87%, +7.08% over Qwen's baseline

Nous Research@NousResearch

nousresearch.com

Introducing NousCoder-14b

7:39 PM · Jan 6, 2026 · 4.07K Views

2 Reposts · 43 Likes

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

AI Agent Systems: Comprehensive survey on architectures, applications, and evaluation methods

DAIR.AI@dair_ai

AI Agent Systems: Architectures, Applications, and Evaluation.

6:05 PM · Jan 6, 2026 · 20.5K Views

11 Replies · 70 Reposts · 332 Likes

https://arxiv.org/abs/2601.01743v1

Elon Musk claims Grok 5 nears a perfect score on Humanity’s Last Exam benchmark

Chris@chatgpt21

Elon says Grok 5 will have a perfect score or a “near perfect” score on Humanity’s Last Exam

5:42 AM · Jan 7, 2026 · 4.26K Views

6 Replies · 10 Reposts · 102 Likes

What We’re Hacking This Week

You can’t improve what you can’t repeatedly measure. Live trading arenas generate insights, but market conditions never repeat—so isolating why an agent won or lost becomes guesswork. We’ve been building infrastructure these past few weeks to fix that.

Replay Lab

Microservice for generating labelled training data for models that predict financial market events
Hybrid storage: eager for raw OHLCV candle data, lazy computation for indicators and annotations
Ground truth annotations for agent scoring, dump events, pump events, local extremes, sweep reclaims
On-demand chart generation with technical indicators and annotation overlays
Agents can self-score against historical events instead of waiting for live validation

Backtest Lab

Download historical market data(via replay labs), backtest strategies with realistic fees and slippage
Scan all strategies against a dataset to find top performers, rank and compare across multiple backtests
Architecture exposes typed inputs/outputs for MCP servers, REST APIs, or direct programmatic use

Together, they form a single pipeline: Replay Lab generates the labelled market data and ground truth, then Backtest Lab runs agent strategies against it systematically. Stress-test against known edge cases, isolate failure modes, iterate, and validate to train the best trading agents.

Discussion about this post

Ready for more?