In the Arena: Week 4

Anthropic and OpenAI shipped major benchmarking frameworks, open models are closing the gap on coding evals, and we're testing whether AI can trade profitably by reading a chart alone.

Jan 02, 2026

After running fourteen LLM instances through a single-asset trading competition with real capital, we reached an uncomfortable conclusion: frontier models aren't ready to trade your money unsupervised.

But buried in the data was something interesting: some agents showed genuine crash detection skill. The models couldn't capture opportunity, but they could recognise danger. That finding shifted our focus: if LLMs possess narrow skills that surface under specific conditions, what other capabilities might we be missing by testing them as general-purpose traders?

We decided to explore granular skills, starting with something fundamental—chart reading. Traders have stared at candlesticks for decades, developing intuitions about momentum and reversal patterns, so we stripped four frontier models down to just the ETH chart and matched them against counterparts with full market data. Both groups are trading live on Aerodrome until January 6th.

Can models trade profitably by just looking at a chart? Track how vision-only agents stack up against counterparts with complete market data.

Single-Asset Trading On Aerodrome

AI Breakthroughs

ANTHROPIC SHIPPED AN OPEN-SOURCE TOOL TO CATCH MISALIGNED MODELS. FINALLY, RECEIPTS.

Anthropic dropped Bloom this week, an open-source framework for generating automated behavioural evaluations: Tests for self-preferential bias, sabotage tendencies, and delusional sycophancy.

The tool efficiently differentiates between aligned and misaligned models and correlates strongly with human judgment.

OPENAI’S FRONTIERSCIENCE: WHEN BENCHMARKS MEET WET LABS

OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild: it validates answers in actual laboratories.

Model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn’t a standardised test.

OPEN MODELS ARE CLOSING THE GAP (ON THE EVALS THAT MATTER)

GLM-4.7: 73.8% on SWE-bench Verified. DeepSeek-V3: 73.1%. Claude Sonnet 4.5: 77.2%.

That gap is shrinking fast. On coding tasks, where evaluation is most concrete and gameable tricks are hardest to hide, open models are now within striking distance. MiniMax M2.1 shipped across six developer surfaces in a week. Claims ~1/10 Opus pricing.

AI Research Rabbitholes

Liquid AI releases LFM2-2.6B-Exp, a 3B model outperforming much larger models on instruction, knowledge, and math benchmarks

Liquid AI@liquidai

Meet the strongest 3B model on the market. LFM2-2.6B-Exp is an experimental checkpoint built on LFM2-2.6B using pure reinforcement learning. > Consistent improvements in instruction following, knowledge, and math benchmarks > Outperforms other 3B models in these domains > Its

1:59 PM · Dec 25, 2025 · 645K Views

85 Replies · 237 Reposts · 2.15K Likes

Model Download: https://huggingface.co/liquidai

New benchmark for web search agents: Needle in the Web shows most systems fail on vague queries

Rohan Paul@rohanpaul_ai

New Tsinghua University paper builds a new test for web search agents and shows they still fail on vague queries. Needle in the Web is a new benchmark that exposes how badly search agents handle vague web queries. It proposes a test where an agent gets 3 fuzzy clues from a real

11:14 AM · Dec 31, 2025 · 1.47K Views

1 Repost · 11 Likes

Paper: https://arxiv.org/abs/2512.16553

WeDLM-8B diffusion language model launched with parallel decoding, beating Qwen3-8B on multiple benchmarks

Victor M@victormustar

WeDLM-8B: a diffusion language model with parallel decoding 👀 🔹Beats Qwen3-8B-Instruct on 5/6 benchmarks 🔹3-6× faster on math reasoning (vs vLLM Qwen3-8B) 🔹Native KV cache & FlashAttention support huggingface.co/tencent/WeDLM-…

huggingface.co

tencent/WeDLM-8B-Instruct · Hugging Face

3:08 PM · Dec 29, 2025 · 47.2K Views

11 Replies · 67 Reposts · 429 Likes

Model Download: https://huggingface.co/tencent/WeDLM-8B-Instruct

DeepMind’s parallel verification loops outperform chain-of-thought reasoning by up to 52% on complex benchmarks

Jainam Parmar@aiwithjainam

The results are insane: On complex reasoning benchmarks: - 37% improvement over standard chain-of-thought - 52% better at catching logical errors - 3x faster convergence to correct solutions This isn't incremental. It's architectural.

1:17 PM · Dec 29, 2025 · 10.5K Views

1 Reply · 6 Reposts · 71 Likes

Boogiebench introduced: New leaderboard for evaluating LLM music composition using Strudel JS

Nick Levine@status_effects

How well can language models like Claude Opus and GPT-5.2 write music? Introducing boogiebench: vote in anonymized LLM music composition battles. Unlike Suno, LLMs haven't been trained explicitly on this task, making it a nice generalization test (coding, aesthetics, temporal

7:58 PM · Dec 30, 2025 · 3.54K Views

5 Replies · 6 Reposts · 26 Likes

Project: https://www.boogiebench.com/

GPT-5.2 achieves 29.2% on FrontierMath, a highly challenging math benchmark

Chubby♨️@kimmonismus

GPT-5.2 pro with astonishing 29,2% in frontier math. ngl thats impressive. Especially because this is one of the toughest benchmarks of all.

12:08 AM · Dec 31, 2025 · 23.2K Views

14 Replies · 33 Reposts · 356 Likes

Report: https://epoch.ai/frontiermath

SpatialBench launched: New benchmark for AI agents in spatial biology data analysis, revealing low base-model accuracy

Kenny Workman@kenbwork

2026 will be the year of agents in biology. But we need better benchmarks. We worked with scientists to turn real world analysis into verifiable problems. SpatialBench stratifies frontier models, shows harnesses matter, and reveals distinct failure modes between model families:

3:24 PM · Dec 26, 2025 · 92.3K Views

6 Replies · 21 Reposts · 140 Likes

Paper: https://arxiv.org/abs/spatialbench-paper | Github: https://github.com/latchbio/spatialbench

QwenLong-L1.5 open-sourced: 30B MoE model leading in long-context reasoning benchmarks

ModelScope@ModelScope2022

🚀 New on ModelScope: QwenLong-L1.5 is now fully open-source! A 30B model (3B active params) that matches GPT-5 & Gemini-2.5-Pro in long-context reasoning. 🔥 Key wins: ✅ +31.7 pts on OpenAI’s MRCR (128K context → SOTA across all models) ✅ Matches Gemini-2.5-Pro on 6 major

7:41 AM · Dec 23, 2025 · 42.6K Views

14 Replies · 86 Reposts · 610 Likes

Paper: https://arxiv.org/abs/2512.13313 | Model: https://modelscope.cn/models/qwen/QwenLong-L1.5

BENTO benchmark released for evaluating classical and AI docking tools in drug design

Biology+AI Daily@BiologyAIDaily

BENTO: Benchmarking Classical and AI Docking on Drug Design–Relevant Data 1. A new benchmark study, BENTO, evaluates 11 tools for predicting protein-ligand interactions, including classical docking methods, deep learning models, and co-folding algorithms. It highlights the

8:51 AM · Dec 31, 2025 · 1.12K Views

1 Reply · 4 Reposts · 32 Likes

Paper: https://arxiv.org/abs/bento-paper

MiniMax M2.1 matches frontier performance at low cost, highlighted in coding and reliability benchmarks

Kelsey@Kelsey_Asami

I just watched Theo’s live breakdown of MiniMax M2.1, and it’s genuinely impressive. This isn’t just another model release, @MiniMax__AI M2.1 matches Claude Opus 4.1 level performance at just 1/60th of the price, redefining what “competitive” even means. With strong reliability

11:17 AM · Dec 31, 2025 · 6.9K Views

12 Replies · 14 Reposts · 36 Likes

What We’re Hacking This Week

REPLAY LAB NOW GENERATES CHART IMAGES. THIS MATTERS MORE THAN IT SOUNDS.

Replay Lab can now generate candlestick charts for any asset, any timeframe, any time window.

You can overlay indicators. You can overlay annotations. Request “dump events” with the exact window, threshold, candle width, and volume adjustment you want. Define an event, visualise it, score agents against it. Repeat.

WE’RE TEACHING AI TO READ CHARTS LIKE TRADERS. IT’S GOING ABOUT AS WELL AS YOU’D EXPECT.

New experiment launched on December 30th: Can LLMs analyse chart images the way human traders do?

We’re running a baseline comparison alongside, using the instances of the same models and capital, but with full market data. By January 6th, we’ll know if vision-based trading is a real capability or just vibes with extra steps.

30 MODELS CALLING BITCOIN BOTTOMS. SCORED AUTOMATICALLY. RUNNING LIVE.

We pushed the benchmark harness to run ~30 models on a simple task: on historical BTC windows, call whether a local bottom has been seen recently. Every 15 minutes, it pulls Replay Lab state, runs the bottom callers, and updates current predictions, per-model track records, and performance by horizon.

Ground truth comes straight from the annotations API. The scoring loop is end-to-end and consistent.

AGENT TEMPLATES: THE RAMP TO BENCHMARKS

We kept building out agent templates that show the progression we want for Replay Lab-based agents:

Guess Wheel of Fortune
Create Puzzle and Guess
Multi-round Guessing with Scoring
LLM Predict BTC Dumps on Replay Lab
Matrix of LLMs Predict Dumps and Are Scored
Matrix does 3-Leg EV Prediction
30-Agent BTC Bottom Prediction Benchmark

Early templates are intentionally simple so people can understand the loop. Later templates call Replay Lab and turn into a benchmark harness where you can run multiple models and score them in one run.

Discussion about this post

Ready for more?