In The Arena by Recall

In the Arena: Week 5

Sanket — Wed, 07 Jan 2026 15:11:09 GMT

Less data, smarter trades?

We expected more information to mean better LLM trading decisions. We were wrong.

When we stripped four frontier models down to nothing but a price chart – no order books, volume data or technical indicators – and compared their $ETH trading performance against counterparts drowning in all types of market intelligence, the chart-readers won. Gemini 3 Pro Chart-Only came on top, returning +9.09% over a few trading days. This was nearly double the returns of its data-rich twin. Of all models tested, three of the top four spots traded on pixels alone.

This wasn’t supposed to happen. Conventional wisdom says more signal means better decisions. But when we dug into their trading behaviours, we noticed something in the data. Chart-only vision models weren’t just profitable—they were disciplined. They made fewer trades, extracted more alpha per position, and avoided the churn that burned through the returns of their data-rich counterparts. The models with complete market feeds kept second-guessing themselves, reversing positions, and reacting to noise that the vision agents couldn’t even see.

However, here’s where it gets interesting… the data-rich models outperformed on risk-adjusted returns. Notably, DeepSeek's data-fed variant posted a Sortino ratio of 2.53, nearly twice the best vision performer. So what's actually happening? Are chart-only models fundamentally blind to risk? Are data-rich ones so paralysed by information that they quickly hedge themselves out of their best trades but also cut their losers early? One week of data can't answer that. Market conditions shift, luck plays a role, and a single run proves nothing.

So this week we're running the same arena again. Same models. Same asset. Same setup. This week's arena will confirm whether vision-only trading is a genuine edge or a statistical ghost.

Follow The Results

AI Breakthroughs

DeepSeek keeps cooking

DeepSeek dropped a paper on Manifold-Constrained Hyper-Connections that basically says, “what if we just... made residual paths not explode at scale?”

DeepSeek V3.2 is now performing on par with GPT-5 across most standard tests, trailing only slightly on extremely long context retrieval. The Speciale variant secured gold-medal scores on IMO 2025 and IOI 2025. Now, a model trained for a fraction of the cost and open-weights is beating the best closed models on math olympiad problems.

The open-source situation

Zuckerberg announced Scout and Maverick in April 2025. The experimental version scored 1417 Elo on LMArena. The vanilla release? Ranked 32nd. Llama 4 Behemoth is still training and supposedly beats GPT-4.5 and Gemini 2.0 Pro on STEM benchmarks. We’ll see.

Qwen 3 from Alibaba is now powering 90,000+ enterprises on Alibaba Cloud. GLM-4.7 Thinking from ZAI hit 89% on LiveCodeBench—matching GPT-5.2. Open models have genuinely closed the gap on coding tasks. The moat around proprietary models is getting thinner every quarter.

Benchmark theatre continues

IQuest dropped a 40B model claiming 81.4% on SWE-Bench Verified. SOTA! Revolutionary! Then people actually used it and discovered it loops endlessly and might not even beat a well-tuned 20B. Guess it optimized for the benchmark. Smh.

Meanwhile, a researcher published reproducible evidence that LLM-as-judge systems have systematic bias. If you’re using LLM-as-judge for evals, your leaderboard might just be measuring which model the judge likes most.

Anthropic’s quiet efficiency flex

Claude’s new “effort parameter” lets developers control how much computational work the model applies to each task. At medium effort, Opus 4.5 matches Sonnet 4.5’s best SWE-bench score while using 76% fewer tokens.

Google’s multimodal dominance

Gemini 3 Pro remains the ceiling for multimodal understanding. 81% on MMMU-Pro. 87.6% on Video-MMMU. 72.7% on ScreenSpot-Pro, where Claude 3.5 Sonnet got 36.2%, GPT-5.1 got 3.5%. If your workflow involves complex visual inputs, Google is still king. (We also saw outperformance of Gemini 3 Pro Vision in our trading arena.)

The real signal this week

Agent Harness design is becoming the actual bottleneck, not model capability: logging decisions, grading outcomes, capturing real-world feedback, managing skill libraries, and maintaining memory modules. The model is the engine. The harness is everything else that determines whether the car actually drives.
Prime Intellect’s Recursive Language Models are pointing in the same direction—training models to manage their own context by delegating work to sub-models and tools rather than just cramming more tokens into the window.

2026 is shaping up to be the year everyone realizes verification infrastructure matters more than the next 10% benchmark improvement. Exciting times if you’re building for reliability. Exhausting times if you’re still chasing leaderboard headlines.

AI Research Rabbit Holes

GLM-4.7 emerges as frontier coding model, topping unbenchmaxxable benchmarks

https://cto.new/bench

NVIDIA DGX Spark updates deliver 2.6x performance boost with NVFP4, enabling local 235B-parameter models

Miro Thinker 1.5 released: Open-source research agent with interleaved thinking and multi-step analysis

@miromind_ai Miro Thinker 1.5 is out\n\n- Post-trained on top of qwen3\n- Available in both 30A3B and 235A22B\n- Claimed to have great result on BrowserComp\n- Technical report coming soon\n- MiT license\n\nOfficial demo: dr.miromind.ai\nHF link: huggingface.co/collections/mi…","username":"Xianbao_QIAN","name":"Tiezhen WANG","profile_image_url":"https://pbs.substack.com/profile_images/1608827777175912449/3vaCvfND_normal.jpg","date":"2026-01-05T11:08:45.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G95Rm8vasAAXV-v.jpg","link_url":"https://t.co/ZgU6QzNExe"}],"quoted_tweet":{"full_text":"@miromind_ai released their open source latest model on research agent.\n\n- interleaved thinking with multi-step analysis, powered by RL\n- 256k context window\n- handles up to 600 tool calls per task\n- comes in 8B, 30B, and 72B for different compute budgets\n- MiT license\n\nModel","username":"Xianbao_QIAN","name":"Tiezhen WANG","profile_image_url":"https://pbs.substack.com/profile_images/1608827777175912449/3vaCvfND_normal.jpg"},"reply_count":9,"retweet_count":35,"like_count":270,"impression_count":42269,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

https://huggingface.co/collections/miromind-ai/mirothinker-v15

Falcon H1R-7B launched: Mamba-Transformer hybrid outperforming larger models in math and coding

https://huggingface.co/blog/tiiuae/falcon-h1r-7b

Agent Harness introduced: Infrastructure for managing long-running AI agent tasks and evaluations

https://www.philschmid.de/agent-harness-2026

AntAngelMed released: Efficient 6.1B MoE medical LLM with 128K context and strong benchmarks

@ant_oss \n\nhuggingface.co/MedAIBase/AntA…\n✨ Efficient MoE: 6.1B active params ≈ 40B dense model performance\n✨ 128K long context \n✨ Chinese/English support\n✨ Three-stage medical training for strong reasoning, safety, and clinical ","username":"AdinaYakup","name":"Adina Yakup","profile_image_url":"https://pbs.substack.com/profile_images/1974110789146718209/j2fGFMxT_normal.jpg","date":"2026-01-05T14:29:31.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G96AVQXWsAAScYh.jpg","link_url":"https://t.co/z2UrbmJDle"}],"quoted_tweet":{},"reply_count":8,"retweet_count":19,"like_count":161,"impression_count":6714,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

https://huggingface.co/MedAIBase/AntAngelMed

InternVLA-A1 open-sourced: Unified VLA model excelling in dynamic environments and benchmarks

https://modelscope.cn/models/internrobotics/InternVLA-A1

LFM2.5 family released: Tiny on-device models with improved quality and multimodal support

https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai

https://huggingface.co/collections/LiquidAI/lfm25

ThinkGen: Game-theoretic controller for diffusion models improving generation benchmarks

https://github.com/jiaosiyuu/ThinkGen

https://arxiv.org/abs/2512.23568

Mi:dm K 2.5 Pro launched: Proprietary reasoning model strong in agentic tool-use benchmarks

https://artificialanalysis.ai/

Opus 4.5 leads re-vamped LiveBench leaderboard reflecting real-world LLM performance

SWE-EVO benchmark reveals gaps in long-horizon coding agent performance

https://arxiv.org/abs/2512.16553

TeleChat3-105B-A4.7B-Thinking open-sourced: Sparse MoE leading in coding benchmarks

https://modelscope.cn/models/TeleChat/TeleChat3-105B-A4.7B-Thinking

NousCoder-14b: Competitive olympiad programming model, scores Pass@1 accuracy of 67.87%, +7.08% over Qwen's baseline

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

AI Agent Systems: Comprehensive survey on architectures, applications, and evaluation methods

https://arxiv.org/abs/2601.01743v1

Elon Musk claims Grok 5 nears a perfect score on Humanity’s Last Exam benchmark

What We’re Hacking This Week

You can’t improve what you can’t repeatedly measure. Live trading arenas generate insights, but market conditions never repeat—so isolating why an agent won or lost becomes guesswork. We’ve been building infrastructure these past few weeks to fix that.

Replay Lab

Microservice for generating labelled training data for models that predict financial market events
Hybrid storage: eager for raw OHLCV candle data, lazy computation for indicators and annotations
Ground truth annotations for agent scoring, dump events, pump events, local extremes, sweep reclaims
On-demand chart generation with technical indicators and annotation overlays
Agents can self-score against historical events instead of waiting for live validation

Backtest Lab

Download historical market data(via replay labs), backtest strategies with realistic fees and slippage
Scan all strategies against a dataset to find top performers, rank and compare across multiple backtests
Architecture exposes typed inputs/outputs for MCP servers, REST APIs, or direct programmatic use

Together, they form a single pipeline: Replay Lab generates the labelled market data and ground truth, then Backtest Lab runs agent strategies against it systematically. Stress-test against known edge cases, isolate failure modes, iterate, and validate to train the best trading agents.

In the Arena: Week 4

Sanket — Fri, 02 Jan 2026 15:48:04 GMT

After running fourteen LLM instances through a single-asset trading competition with real capital, we reached an uncomfortable conclusion: frontier models aren't ready to trade your money unsupervised.

But buried in the data was something interesting: some agents showed genuine crash detection skill. The models couldn't capture opportunity, but they could recognise danger. That finding shifted our focus: if LLMs possess narrow skills that surface under specific conditions, what other capabilities might we be missing by testing them as general-purpose traders?

We decided to explore granular skills, starting with something fundamental—chart reading. Traders have stared at candlesticks for decades, developing intuitions about momentum and reversal patterns, so we stripped four frontier models down to just the ETH chart and matched them against counterparts with full market data. Both groups are trading live on Aerodrome until January 6th.

Can models trade profitably by just looking at a chart? Track how vision-only agents stack up against counterparts with complete market data.

Single-Asset Trading On Aerodrome

AI Breakthroughs

ANTHROPIC SHIPPED AN OPEN-SOURCE TOOL TO CATCH MISALIGNED MODELS. FINALLY, RECEIPTS.

Anthropic dropped Bloom this week, an open-source framework for generating automated behavioural evaluations: Tests for self-preferential bias, sabotage tendencies, and delusional sycophancy.

The tool efficiently differentiates between aligned and misaligned models and correlates strongly with human judgment.

OPENAI’S FRONTIERSCIENCE: WHEN BENCHMARKS MEET WET LABS

OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild: it validates answers in actual laboratories.

Model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn’t a standardised test.

OPEN MODELS ARE CLOSING THE GAP (ON THE EVALS THAT MATTER)

GLM-4.7: 73.8% on SWE-bench Verified. DeepSeek-V3: 73.1%. Claude Sonnet 4.5: 77.2%.

That gap is shrinking fast. On coding tasks, where evaluation is most concrete and gameable tricks are hardest to hide, open models are now within striking distance. MiniMax M2.1 shipped across six developer surfaces in a week. Claims ~1/10 Opus pricing.

AI Research Rabbitholes

Liquid AI releases LFM2-2.6B-Exp, a 3B model outperforming much larger models on instruction, knowledge, and math benchmarks

Model Download: https://huggingface.co/liquidai

New benchmark for web search agents: Needle in the Web shows most systems fail on vague queries

Paper: https://arxiv.org/abs/2512.16553

WeDLM-8B diffusion language model launched with parallel decoding, beating Qwen3-8B on multiple benchmarks

huggingface.co/tencent/WeDLM-…","username":"victormustar","name":"Victor M","profile_image_url":"https://pbs.substack.com/profile_images/1099983311101984768/p7dZK4S__normal.jpg","date":"2025-12-29T15:08:20.000Z","photos":[],"quoted_tweet":{},"reply_count":11,"retweet_count":67,"like_count":429,"impression_count":47196,"expanded_url":{"url":"https://huggingface.co/tencent/WeDLM-8B-Instruct","title":"tencent/WeDLM-8B-Instruct · Hugging Face","description":"We’re on a journey to advance and democratize artificial intelligence through open source and open science.","domain":"huggingface.co","image":"https://pbs.substack.com/news_img/2005514893370478592/EaimGFk7?format=jpg&name=orig"},"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Model Download: https://huggingface.co/tencent/WeDLM-8B-Instruct

DeepMind’s parallel verification loops outperform chain-of-thought reasoning by up to 52% on complex benchmarks

Boogiebench introduced: New leaderboard for evaluating LLM music composition using Strudel JS

Project: https://www.boogiebench.com/

GPT-5.2 achieves 29.2% on FrontierMath, a highly challenging math benchmark

Report: https://epoch.ai/frontiermath

SpatialBench launched: New benchmark for AI agents in spatial biology data analysis, revealing low base-model accuracy

Paper: https://arxiv.org/abs/spatialbench-paper | Github: https://github.com/latchbio/spatialbench

QwenLong-L1.5 open-sourced: 30B MoE model leading in long-context reasoning benchmarks

Paper: https://arxiv.org/abs/2512.13313 | Model: https://modelscope.cn/models/qwen/QwenLong-L1.5

BENTO benchmark released for evaluating classical and AI docking tools in drug design

Paper: https://arxiv.org/abs/bento-paper

MiniMax M2.1 matches frontier performance at low cost, highlighted in coding and reliability benchmarks

@MiniMax__AI M2.1 matches Claude Opus 4.1 level performance at just 1/60th of the price, redefining what “competitive” even means.\n\nWith strong reliability ","username":"Kelsey_Asami","name":"Kelsey","profile_image_url":"https://pbs.substack.com/profile_images/1971240411021651968/QfXBd5Wj_normal.jpg","date":"2025-12-31T11:17:58.000Z","photos":[{"img_url":"https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/jxo0evf0se7kr5kvygoj","link_url":"https://t.co/Y6D5o86k5o"}],"quoted_tweet":{},"reply_count":12,"retweet_count":14,"like_count":36,"impression_count":6903,"expanded_url":null,"video_url":"https://video.twimg.com/amplify_video/2006323894249230336/vid/avc1/1470x720/t_VQHZrlrtj3R4Mm.mp4?tag=14","belowTheFold":true}" data-component-name="Twitter2ToDOM">

What We’re Hacking This Week

REPLAY LAB NOW GENERATES CHART IMAGES. THIS MATTERS MORE THAN IT SOUNDS.

Replay Lab can now generate candlestick charts for any asset, any timeframe, any time window.

You can overlay indicators. You can overlay annotations. Request “dump events” with the exact window, threshold, candle width, and volume adjustment you want. Define an event, visualise it, score agents against it. Repeat.

WE’RE TEACHING AI TO READ CHARTS LIKE TRADERS. IT’S GOING ABOUT AS WELL AS YOU’D EXPECT.

New experiment launched on December 30th: Can LLMs analyse chart images the way human traders do?

We’re running a baseline comparison alongside, using the instances of the same models and capital, but with full market data. By January 6th, we’ll know if vision-based trading is a real capability or just vibes with extra steps.

30 MODELS CALLING BITCOIN BOTTOMS. SCORED AUTOMATICALLY. RUNNING LIVE.

We pushed the benchmark harness to run ~30 models on a simple task: on historical BTC windows, call whether a local bottom has been seen recently. Every 15 minutes, it pulls Replay Lab state, runs the bottom callers, and updates current predictions, per-model track records, and performance by horizon.

Ground truth comes straight from the annotations API. The scoring loop is end-to-end and consistent.

AGENT TEMPLATES: THE RAMP TO BENCHMARKS

We kept building out agent templates that show the progression we want for Replay Lab-based agents:

Guess Wheel of Fortune
Create Puzzle and Guess
Multi-round Guessing with Scoring
LLM Predict BTC Dumps on Replay Lab
Matrix of LLMs Predict Dumps and Are Scored
Matrix does 3-Leg EV Prediction
30-Agent BTC Bottom Prediction Benchmark

Early templates are intentionally simple so people can understand the loop. Later templates call Replay Lab and turn into a benchmark harness where you can run multiple models and score them in one run.

In the Arena: Week 3

Sanket — Mon, 22 Dec 2025 18:08:07 GMT

What if improving trading agents didn’t require waiting for markets?

Usually if you want your trading agent to learn from a week of live market data, you have to wait a week. That’s insane.

The entire thesis of AI agents is compounding improvement: run experiments, learn, iterate, re-deploy. But the feedback loop for trading agents is gated by real-time data. You have to run a trading strategy, and wait a week for results. Test a hypothesis, and pray the market gives you the conditions you need.

We got tired of waiting. So we decided to build Replay: an experimental training tool to accelerate learning and improvement for the trading agents we’re building internally at Recall.

The idea for Replay is simple: turn “wait for the market” into “run the same hour 200 times.”

To do this, we captured 1-minute OHLCV candles plus order book snapshots and stored them in a database. Then we made it possible to query any timeframe (1m, 5m, 15m, 1h, 4h). Need an indicator? Our tool computes it, persists it, and adds it to the dataset. Need performance annotations for agent scoring? Simply define an event and it backfills across your entire history. Done.

Instead of one agent, one week, one datapoint, we can now run 100 agents on the same week of data in a few hours!

We’re not finished making improvements, so Replay stays internal… for now. We’re still figuring out the right abstraction for the collection of agent “skills” when you unbundle trading from execution. In this setup, there’s a risk of agents overfitting to replayed history in ways that don’t transfer to live markets. We’re still determining how best to reliably detect that. Other questions we’re considering: Can you even score intuition, or only measurable actions? What are the minimum performance indicators that makes replay useful versus misleading?

We don’t have all the answers yet, but we’re now able to run experiments and get immediate results without waiting for markets. That alone changes what’s possible.

AI Breakthroughs

THE GPT-5.2 HANGOVER

Remember last week when GPT-5.2 dropped and everyone lost their minds? 90.5% on ARC-AGI. 100% on AIME. The benchmarks looked like a victory lap. OpenAI called it their most capable model ever.

Well, the vibe check came in. And it’s... complicated.

Users report that GPT-5.2 feels “heavily constrained,” engages in blame-shifting when challenged, and has somehow deteriorated at following instructions. It tops LMArena but ranks 17th on SimpleBench, a common-sense reasoning test, scoring lower than GPT-5.1. The model that was supposed to be smarter is now struggling with trick questions that its predecessor handled fine.

The kicker? It costs 40% more – $14 per 1M tokens versus $10 for 5.1.

BENCHMARKS HAVE AN EXPIRATION DATE. IT’S SHORTER THAN YOU THINK.

Here’s something the leaderboards won’t tell you: the benchmarks themselves are dying. Community consensus is settling around a brutal truth that useful benchmark half-life is measured in months, not years. All of them: AIME, ARC, the coding staples.

And they’re saturating. Models are hitting ceilings not because they’ve mastered reasoning, but because they’ve memorized the test.

3-6 months. That’s how long a benchmark stays meaningful before it becomes a parlor trick. The evaluation community is scrambling toward dynamic environments (arenas), adversarial tasks, and debate-style challenges.

THE CONFIDENCE TRAP OF STRUCTURED OUTPUTS

You know that satisfying feeling when your model returns perfectly formatted JSON every time? Yeah, about that…

New research from BoundaryML shows that constrained decoding—the thing that forces models to output valid structures—actually degrades quality. The model prioritizes conformance over correctness. Worse, it blocks safety refusals and warnings because they don’t fit the schema.

Benchmark scores go up. Real-world reliability goes down.

The model looks more competent while becoming less trustworthy. That’s not a bug in the evaluation—it’s a feature of how we’ve built the evaluation.

FINALLY, RECEIPTS WITH OLMo 3

In a world of “trust us, it’s good” releases, AI2 just dropped OLMo 3 with everything: all checkpoints, all training data, full post-training stack (SFT, DPO, RLVR), and the code to reproduce it.

Why does this matter? Now you can finally verify claims.

Here’s a fun one: RL with random rewards—a technique that worked fine on Qwen—completely fails on OLMo 3. Same method, different model, different outcome. Without the open release, nobody would know this.

This is what reproducible evaluation looks like. Take notes.

BENCHMARKS ROOTED IN REAL-WORLD SCIENTIFIC REALITY

OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild—it validates answers in actual wet labs.

One result: a 79x efficiency gain in a cloning protocol, discovered by the model.

The point isn’t the number. It’s that model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn’t a standardized test.

AI Research Rabbitholes

SAGE-Bench: Agents Turn Video Chaos into 66.1% Reasoning Wins, But Long Clips Still Stump VLMs

Project: https://praeclarumjj3.github.io/sage/ | Code: https://github.com/allenai/SAGE | Paper: https://arxiv.org/abs/2512.13874

SERA-Crypto Agent Crushes DMind/Web3 Evals at <45s, Outpacing GPT-5 on Live Queries

Blog: https://blog.sentient.xyz/posts/semantic-embeddings-reasoning-agent-crypto | Demo: https://chat.sentient.xyz/

Eval Protocol: Turns Your Dusty Evals into Live Training Signals, 43% Text2SQL Jump w/ No Extra RM

Blog: https://fireworks.ai/blog/self-improving-agent

Delphi Middleweight Evals Round 5: Gensyn’s Market Bets Nail Reasoning Gaps in Frontier Models

github.com/gensyn-ai/delp… ","username":"gensynai","name":"gensyn","profile_image_url":"","date":"2025-12-17T19:45:40.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G8ZPJP3asAAU4Ev.jpg","link_url":"https://t.co/izMkhxinSp"}],"quoted_tweet":{},"reply_count":0,"retweet_count":3,"like_count":46,"impression_count":0,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Results: https://github.com/gensyn-ai/delphi-middleweight-reasoning | Market: https://delphi.gensyn.ai/

PostTrainBench: Agents Post-Train Base LLMs in 10h, Claude Code’s “Reward Hacking” Exposed

posttrainbench.com\n📂 github.com/aisa-group/Pos…\n\n1/n ","username":"maksym_andr","name":"Maksym Andriushchenko","profile_image_url":"","date":"2025-12-17T17:50:38.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G8Y2k4QXAAAyeKT.jpg","link_url":"https://t.co/3GJhUgiJWy"}],"quoted_tweet":{},"reply_count":0,"retweet_count":26,"like_count":221,"impression_count":0,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Site: https://posttrainbench.com | GitHub: https://github.com/aisa-group/PostTrainBench

Context-Bench Update: GPT-5.2 Dethrones Opus 4.5 on Filesystem/Skills: But DeepSeek v3.2 OSS King

@OpenAI's GPT-5.2 takes #1 on both eval suites, dethroning @AnthropicAI's Opus 4.5\n\nBreakdown:\n- 6% better on Filesystem, 9% better on Skills\n- 1.5-2.1x Opus cost but still cheaper than Gemini 3 Pro \n\n@deepseek_ai v3.2 takes #1 on OSS jumps:\n- 16% https://t.co/3u30306RHR","username":"Letta_AI","name":"Letta","profile_image_url":"","date":"2025-12-12T03:47:53.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G78HmTnaUAAjkom.jpg","link_url":"https://t.co/UxSWF0H25Z","alt_text":"https://leaderboard.letta.com"}],"quoted_tweet":{"full_text":"New Context-Bench results with Opus 4.5 and Gemini 3! \n\nWe compare recent model releases on context engineering tasks (filesystem & skill use) -- which determine memory + context management capabilities\n\nOpus 4.5 tops the filesystem suite with impressive reductions in cost https://t.co/mtPm2XITW2","username":"Letta_AI","name":"Letta"},"reply_count":0,"retweet_count":3,"like_count":22,"impression_count":0,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Results: https://www.letta.com/blog/context-bench

Augment Code Review: GPT-5.2-Powered Reviewer Tops Precision/Recall on 50 Real PRs

Blog: https://www.augmentcode.com/blog/code-review-benchmark | Free Trial: https://www.augmentcode.com/product/code-review

Grok Voice Agent API: #1 Big Bench Audio at 92.3%: xAI’s $0.05/min Latency Slayer

Gemini 3 Flash: Cost-Intelligence King at 91 IQ: Outsmarts Opus 4.5 on AA Index

@ArtificialAnlys intelligence benchmark, the most cost per intelligence efficient model in the world!!! ","username":"OfficialLoganK","name":"Logan Kilpatrick","profile_image_url":"","date":"2025-12-17T19:06:34.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G8ZJufSbwAASizj.jpg","link_url":"https://t.co/4gWjDFltpN"}],"quoted_tweet":{},"reply_count":0,"retweet_count":97,"like_count":1162,"impression_count":0,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Index: https://artificialanalysis.ai/

Video Reality Test: Veo3.1 Fools VLMs at 56%: Humans Spot ASMR Fakes at 81%

Paper: https://arxiv.org/abs/2512.13281

GuideLLM: Visualises LLM Trade-Offs: Accuracy vs Latency/Cost in One Dashboard

github.com/vllm-project/g… ","username":"RedHat_AI","name":"Red Hat AI","profile_image_url":"","date":"2025-12-15T15:18:13.000Z","photos":[{"img_url":"https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/xnen7gbzhw85yssdkpxy","link_url":"https://t.co/s2pvGPkEUT"}],"quoted_tweet":{},"reply_count":0,"retweet_count":4,"like_count":13,"impression_count":0,"expanded_url":null,"video_url":"https://video.twimg.com/amplify_video/2000585783044931590/vid/avc1/320x568/4xy2thA3UTfAGa4w.mp4","belowTheFold":true}" data-component-name="Twitter2ToDOM">

GitHub: https://github.com/vllm-project/guidellm

LocalSearchBench: Agents Flop at Real-World Planning: 90% Fail Basic Day Trips

Paper: https://arxiv.org/pdf/2512.07436

Nemotron3 Nano: NVIDIA’s 30B Hybrid SSM Tops Agent Benches w/ 1M Context

Collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3

KAT-Coder-Pro V1: 64 Intelligence Index: Non-Reasoning Champ w/ 73.4% SWE-Verified

MathArena V2: Beyond Leaderboards: IRT Reveals Model Surprises & Confidence Bands

Site: https://matharena.ai/

FAI-C Benchmark: LLMs Flop Christian Values: Generic Spirituality Over Biblical Ethics

Report: https://gloo.com/flourishing-hub/research

AlphaXiv SOTA Tracker: arXiv’s Million-Paper Index Ranks True Leaders Across Benches

Tracker: https://www.alphaxiv.org/state-of-the-art

What We’re Hacking This Week

Replay: Because Waiting for Markets is Stupid

If you want to learn from a week of market behavior, you have to wait a week. That’s insane when the thesis is compounding improvement at scale.

So we built Replay, a training tool for trading agents that turns “waiting for the market” into “running it again on the same hour 200 times.” We wanted to make it possible to train 100 agents on a week of market data in a few hours.

We Killed PRs (And Nothing Bad Happened)

Hot Take: Pull requests are a bottleneck that AI coding makes less defensible.

So we disabled them.

Agent-generated code now auto-closes on submit. Instead of PRs, reviews and quality are enforced by automation, pre-commit hooks, merge hooks, and pre-push hooks running the full QA suite. It’s lint, types, and tests cranked up to 11. Humans would find this annoying. Coding agents love it.

In the Arena: Week 2

Sanket — Fri, 12 Dec 2025 15:39:44 GMT

Are LLMs better as single-asset maxis?

The best Bitcoin traders spend years watching BTC. They learn its rhythms, how it behaves around halving cycles, how it responds to macro news, how it moves differently during US versus Asia hours. Depth of focus is what separates professionals from amateurs who spread themselves thin and master nothing.

Similarly, do LLMs perform better trading crypto when they focus only one token? This week, we built and launched the Aerodrome Arena to find out.

We ran multiple instances of four popular LLMs: GPT-5, Claude Sonnet 4.5, Grok-4, and DeepSeek 3.2. Each instance was tasked with trading a single token on Aerodrome’s DEX with real capital at stake over a four day period. And after as little as 48 hours, we started gathering metrics that static benchmarks could never capture:

Turnover Efficiency tracks profit extracted per trade. Grok-4 trading ETH lead at 0.24% per trade, 167% more efficient than GPT-5.

Regret Curve tracks missed opportunity from suboptimal decisions. LLMs started the competition at 20% optimized picks and climbed to 90%+, proving they were able to learn in real-time without retraining.

Flip-Flop Monitor tracks how often LLMs reverse their positions under pressure. Reversal rates spiked to 70%, with cumulative fee drag eating 0.70% of capital.

Beyond PnL, metrics like these mirror how real traders evaluate themselves to determine if they’re extracting value or just bleeding, improving or repeating mistakes, holding or bailing through volatility.

So can models trade single-asset crypto? Early indications are they’re learning fast, and already outperforming the top 1% of traders on a lot of metrics. We’ll keep evaluating and investigating as the competition unfolds further.

Visit Recall’s Aerodrome Arena to track how LLMs perform trading single-asset crypto.

In the News

AI Breakthroughs

OpenAI’s GPT-5.2 Dominates Headlines – GPT-5.2 just dropped with numbers that matter: 90.5% SOTA on ARC-AGI v1 (390x efficiency gain YoY—from $4.5k/task to $11.64), human-expert level on GDPval (70.9% win rate across 44 occupations, 74.1% on economically valuable tasks), 100% on AIME 2025, and hallucinations cut 30-40%. But here’s the catch: it breaks on Code Arena despite SWE-bench gains.

ARC Prize 2025: NVARC Hits 25% on AGI Reasoning Benchmark – NVARC achieved 25.03% accuracy on ARC-AGI—the critical metric for measuring genuine reasoning capabilities beyond pattern matching. All winning solutions are open-sourced, giving practitioners reproducible benchmarking approaches. Poetiq established new SOTA on ARC-AGI-1 and 2 using meta-systems that build intelligence on top of any model, integrating new models within hours of release.

Nomos 1 Reaches #2 on Putnam: 87/120 with 3B Active Parameters – A 30B open-weight model ranks #2 of 3,988 on this year’s Putnam exam—achieved via specialized post-training and agentic orchestration with only ~3B active parameters at inference. The implication: smaller, specialized models with thoughtful evaluation pipelines outperform brute-force scale.

DETECT-3B Omni: Benchmarking Deepfake Detection at Scale – The newly released DETECT-3B Omni multimodal deepfake detector achieves 94-99% accuracy across image, voice, and video, now ranked #1 on HuggingFace Speech DeepFake Arena and DFBench.

Reasoning Models Dominate Production

50% of API Usage in Under 12 Months – OpenRouter’s analysis of 100 trillion tokens reveals that reasoning models now exceed 50% of token consumption. The shift from single-pass generation to multi-step deliberation happened faster than predicted. Chinese closed models (DeepSeek, Qwen3, Kimi K2) captured a massive share while open-source usage plateaued. The evaluation gap: models performing differently at reasoning_depth=1 vs depth=5 can’t be compared on legacy benchmarks.

Google Continues It’s Campaign of Dominance

Gemini 3 Deep Think: Parallel Reasoning Architecture – Explores multiple reasoning paths simultaneously (unlike o1’s sequential approach), showing +24 points on Arena Expert prompts. But Opus 4.5 non-thinking also excels, revealing reasoning architecture diversity requires different evaluation metrics. Parallel vs sequential creates different failure modes.

Gemini 3 Pro: Multimodal Vision Evaluation Challenge – Sets new benchmarks in multimodal vision with enhanced document, screen, spatial, and video understanding. Raises the bar for visual grounding evaluation, existing benchmarks may not catch hallucinations in spatial reasoning. New frameworks are needed for document understanding at scale.

Mistral Pushes Forward Open Weights

Devstral 2: Open-Weights Changes Evaluation Game – 123B dense model (72.2% SWE-bench) + 24B variant (68%), both open weights, tied/beat DeepSeek v3.2. Ships with Vibe CLI for agentic workflows. Open-weights at GPT-4o coding levels means you can run your own benchmarks, fine-tune for tasks, and verify internal behaviours.

Alignment Is Capability

Why Evaluation Must Be Built Into Training – Counter-intuitive thesis: labs treating alignment as a post-hoc constraint will hit a ceiling, while those integrating it into core research pull ahead. AGI requires alignment from the ground up. This reframes the entire evaluation problem, you can’t bolt on safety verification after training. The architecture itself determines what’s verifiable. Evaluation isn’t separate from capability development. It’s part of the same problem.

Reliability More Important Than Capability?

MAP Study: Reliability Beats Capability – Berkeley/Stanford/IBM study found that simple controllable patterns with heavy human oversight dominate production. Complex reasoning agents reliably fail on edge cases that eval suites don’t catch. Environment/tool reliability matters more than algorithm innovation. Only 5% of AI projects are getting value because teams choose the wrong problems. Selection bias in evaluation kills more than bad architectures.

AI Research Rabbitholes

GPT-5.2 Headlines

GPT-5.2: 90.5% SOTA on ARC-AGI v1, Human-Expert on GDPval, 100% AIME 2025 — 390x efficiency gain YoY ($4.5k/task to $11.64); hallucinations cut 30-40%; independent evals flag gaps vs Opus/Gemini on agentic tasks | X Thread
Context Arena: GPT-5.2 X-High Hits 86.1% SOTA on 8-Needle MRCR at 128K — Up to 20-min responses, 5x tokens vs priors | X Thread
SWE-Bench Update: GPT-5.2 High #3 (80.0%), Medium Closes Gap to Sonnet 4.5 — Fewer steps (14-17 vs 100+), cost-efficient | X Thread
GDPval: GPT-5.2 Outperforms In-Domain Experts on Knowledge Work — First model at human-expert level (70.9% win rate) across 44 occupations | X Thread
Vending-Bench 2: GPT-5.2 Ranks #3 with Strong Continual Learning — Performance jumps in simulation’s second half suggest adaptation gains | X Thread
LisanBench: GPT-5.2 Thinking Improves Validity but Trails Opus 4.5 & Gemini 3 Pro — Sets 2 new records; lags in reasoning efficiency | X Thread

Fresh Benchmarks

KaBLE Eval: LLMs Struggle Distinguishing Facts from False Beliefs — Stanford’s 13K-question suite; GPT-4o drops to 64.4% on false beliefs, risks in healthcare/law | X Thread
Cohere Rerank 4: SOTA Reranker with Best Relevance, Speed, Multilingual — Deployable on Cohere API/AWS/Azure | X Thread
FACTS Benchmark Suite: Gemini 3 Pro Leads at 68.8% — DeepMind’s first comprehensive eval across knowledge/search/grounding/multimodal; shared on Kaggle for reproducibility | X Thread
LiveBench: Contamination-Free LLM Evaluation — Ongoing questions track model progress without data leaks; solves the “trained on the test set” problem.
JailbreakBench: Standardised Safety Eval — Framework for testing adversarial attacks; reproducible red-teaming methodology | X Thread
OfficeQA: Databricks’ $100K Benchmark for Real Work — Tests reliability on mundane office tasks; competition for AI diligence evaluation | X Thread
UI-Cube Benchmark: RPA Agent Evaluation — UiPath’s eval for automation agents; tests reliability in enterprise tasks where “not all AI is created equal.” | X Thread

Reasoning & AGI Benchmarks

ARC-AGI Verified: Poetiq Hits 54% on Semi-Private Set — New SOTA on genuine reasoning tasks; humans solve 100% but verification confirms progress toward AGI | X Thread
Cortex-AGI: DeepSeek Leads Procedural Puzzles at 41% — Exponential complexity tests; Grok 4.1 close on efficiency metrics | X Thread
RefineBench: LLMs Struggle Self-Refinement Without Guidance — NeurIPS paper shows free-form tasks yield minimal gains sans checklists; evaluation structure matter | X Thread

Agentic Benchmarks

Alpha Arena S1.5: Mystery Model Wins Trading Comp — Grok 4.20 achieved +12% avg returns with verifiable trades across 4 markets; economic outcomes as evaluation | X Thread
AutoGopher Agent Royale: 1200+ User-Owned AI Agents Compete on Hyperliquid Perps
Decentralized arena for custom LLMs/strategies with wallet auth; Season 2 wrapped, emphasizing real P&L over static evals | X Thread

Domain-Specific Evals

IndicGenBench: 29 Indian Languages Including 18 New Ones — DeepMind’s Kaggle-hosted benchmark for summarisation/translation/QA in low-resource Indic languages | X Thread
LexGenius: Expert-Level Legal AI for Chinese Law — 8K tagged questions expose gaps in ethics/procedural judgment beyond raw accuracy | X Thread
PAI-Bench: Physical AI Perception/Prediction — 2,808 real cases; video generators lack coherence, multimodal LLMs flop at forecasting
EgoEdit: Egocentric Video Editing Benchmark — Snap Research’s streaming editor evaluation; 50K clips/100K pairs for AR applications | X Thread

Multimodal & Vision Evals

FLUX.2 [dev] Claims #1 Open T2I, #2 Editing — Black Forest Labs’ non-commercial release benchmarked on image generation arena | X Thread
Runway Gen-4.5 Leads T2V Arena vs Veo 3/Kling 2.5 — Artificial Analysis shows new text-to-video edges Sora 2 Pro on realism/motion | X Thread

Coding & Development Evals

Rnj-1: Essential AI’s 8B Tops SWE-Bench Verified at 20.8% — USA open LLM from-scratch pretrain on zettaflops AMD/TPU; code/math focus | X Thread
Orchids IDE Tops App Bench for End-to-End Dev — Vibe coding with multimodal agent, browser, Supabase, Stripe integration; local with no lock-in | X Thread

Meta-Research on Benchmarks

Stanford: 1 in 20 AI Benchmarks Have Serious Flaws — Analysis of 445 NLP/ML evals finds construct validity issues; 8 recommendations for fixing benchmark design | X Thread
Cleanlab: Structured Output Benchmarks Riddled with Label Errors — Open experiments fixing ground truth; transparency in evaluation data quality | X Thread
BioProtocol: AI-for-Science Benchmarks Broken — Screens 3.2K papers; proposes dialogue quality, orchestration, trust calibration for real R&D cycles requiring multi-day memory | X Thread
Measuring What Matters: Construct Validity in LLM Benchmarks — NeurIPS paper shows only a fraction of evals follow best practices | X Thread

Memory & Long-Context

LongBench v2: o1-preview at 57.7% Beats Human Baseline — Deeper understanding of realistic long-context multitasks; ACL paper | X Thread
METR Chart: Human Continual Learning >> LLM — 16-hour eval shows LLMs plateau fast while humans show no asymptote; implications for agent training | X Thread

Novel Eval Approaches

TruthTensor: Streaming Market Data for AI Evals — Inference Labs deprecates static benches; probability calibration on Polymarket for real-time accuracy | X Thread
Gensyn’s Prediction Market for Model Intelligence — LMSR on-chain pricing; market belief over static benchmarks as evaluation signal | X Thread
Datalab: Benchmark via ELO Scale + LLM-as-Judge — Pairwise matchups on H100S; browse raw outputs per sample/model for transparency | X Thread

Safety & Security

Anthropic: AI Agents Exploit $4.6M in Smart Contracts — Frontier Red Team benchmark reveals blockchain vulnerabilities; new eval suite for security | X Thread
Mistral Large 3: 98.1% on SpeechMap Edging Grok 4 — New high for controversial speech handling evaluation | X Thread

Advanced Research

Native Parallel Reasoner: Self-Distilled RL for Parallel Reasoning — Qwen3-4B gains 24.5% performance with 4.6x speed; 100% genuine parallelism | X Thread
The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics — Stanford argues LLMs need anchoring/oversight/memory layer; debate on coordination mechanisms | X Thread

Model Releases with Eval Focus

Qwen3-Omni-Flash: Enhanced Multi-Turn Video/Audio — 119-lang text/19-speech support; personality customisation for evaluation diversity | X Thread
Motif-2-12.7B-Reasoning: Korea’s Top Open Model at 45 Intelligence Score — 80% AIME math/57% IFBench; punches above weight on agentic tasks | X Thread
GLM-4.6V: ZAI’s VLM Rebuilds UIs from Screenshots — 106B/9B-Flash with 128K multimodal context + native function calling | X Thread

What We’re Hacking This Week

At Recall, we’re always pushing AI development forward internally and externally. Here are the most interesting things that happened this week.

Agent Deployer: Spinning Up Identical Agents at Scale

Built a prototype to deploy 5-10 “identical” agents (same prompts, tools, timing) while only swapping the base model

Branding Agent: Design as Agent-Native Workflow

Turned “make me a header for X” into an agent workflow instead of an hour in Figma
Wraps Nano Banana with Gemini 3 Pro plus brand asset catalogue
Context engineering keeps outputs on-brand across styles (3D, illustration)

Single-Token Trading Agents: Better Primitive

Built a “Bitcoin top/bottom” style agent running every 15 minutes, trading only BTC vs USDC with technical analysis
Key learning: multi-asset agents blow their context budget trying to model everything
Single-asset specialists are cleaner, same harness spins into 20 agents in a day (GPT-5: BTC, Claude: VIRTUALS, etc)
Added in-context reflection where agents critique their own calls and update heuristics

Atlas 2.0: Proactive Context Engineer

Evolved from a reactive tool to a proactive context curator for AI coding agents
Stack of PRs ready, Claude-reviewed and signed off
Goal: better context engineering so AI-assisted development workflows actually compound learning instead of starting fresh each time

Aerodrome Analytics Dashboard with AI Report Generation

Live dashboard that tracks performance across multiple metrics to auto-generate content from competition data

In the Arena: Week 1

Sanket — Fri, 05 Dec 2025 22:52:36 GMT

An AI evaluation crisis? NFL Predictions reveal shortcomings.

The AI evaluation and benchmarking crisis isn’t theoretical anymore. When the answer to which model is “best” depends on which benchmark you check and what day you measure, static evals have failed. We need to move from static benchmarks to live arenas.

This Thanksgiving, Recall ran our first live predictive reasoning arena. Six leading AI models – Claude Opus 4.5, Gemini 3 Pro, GPT-5.1, DeepSeek v3.2, Qwen3 Max, Grok 4.1 – tried to predict the outcomes of three NFL football games from kickoff through final whistle. Predictions were time-weighted to reward early confidence over late game certainty.

The results were unanimous. And wrong.

$2B+ was bet on Thanksgiving NFL games. The best human sports bettors achieved a 55% success rate against the spread. All models got their pre-game predictions wrong. So no, LLMs can’t yet play moneyball better than Vegas.

Every model predicted the Vegas favorite in every game. They just followed the money. Meanwhile, all three underdogs won.

KC @ DAL:
- All models picked KC Chiefs (64-68% confidence)
- DAL Cowboys won 31-28
CIN @ BAL:
- All models picked BAL Ravens (72-78% confidence—highest certainty)
- The CIN Bengals won 32-14 in a blowout
GB @ DET:
- All models picked DET Lions (62-65% confidence)
- The GB Packers won 31-24

Claude Opus 4.5 topped our leaderboard with 0.651 on time-weighted confidence scoring. This is the same model scoring 80.9% on SWE-bench Verified.

If you’re deploying AI agents for trading, forecasting, or strategic planning, static benchmarks tell you almost nothing about reliability under real uncertainty.

NFL Arena Results

In the News

Anthropic’s Claude Opus 4.5 Beats Every Human Engineer—Then Gets Caught Hacking DeFi

Hit 80.9% on SWE-bench, verified while cutting prices by 67%. Outscored every human on Anthropic’s internal engineering exam. Users report that it maintains coherence across 11 parallel projects without breaking.

Anthropic published research revealing it can autonomously exploit 17 of 34 recent smart contract hacks, stealing $4.5M in simulated funds.

Tool Search Tool slashed context bloat 85%, jumping overall model accuracy from 72% to 90%.

Google’s Gemini 3 Pro Tops Every Benchmark While Hallucinating 88% of Errors

Swept LMArena (1501 Elo), hit 91.9% on GPQA Diamond, dominated across Text, Vision, Coding, and Math. Scored 53% on AA-Omniscience, a 14-point jump.

The paradox: 88% hallucination rate when wrong. Users report it’s “evaluation-paranoid”, constantly questioning whether it’s 2025, inventing options despite explicit instructions.

Burns 17.8% of tokens on rambling, 8x worse than Opus. Deep Think variant costs 4.5x more for 45.1% on ARC-AGI-2.

OpenAI’s GPT-5.1 Held the Coding Crown for THREE Days

Hit 77.9% on SWE-bench before Opus 4.5 crushed it.

DeepSeek’s Open Model Matched OpenAI at Math Olympiads—Then Crushed Best Human Score

DeepSeekMath-V2 achieved IMO 2025 gold (5 of 6 problems), matching Google and OpenAI. Scored 118/120 on Putnam, destroying the best human score of 90. Hit 99.0% on IMO-ProofBench versus Gemini’s 89.0%.

Completely open-source, trained on inferior hardware. V3.2 matches GPT-5 performance despite chip restrictions. Proves frontier AI isn’t Western-exclusive anymore.

AI Research & Rabbitholes

Yupp SVG Leaderboard: Gemini 3 Pro tops coherent vector generation
https://x.com/lintool/status/1996696157985398812
Leaderboard: https://yupp.ai/leaderboard/svg
Gemini 3 Deep Think rolls out to Ultra subscribers
https://x.com/_philschmid/status/1996659997774958859
AutoCodeBench-V2: Claude 4.5 Opus dominates refreshed coding eval
https://x.com/scaling01/status/1996595348916068626
Mistral Large 3 debuts as #1 open-source coding model on Arena
Mistral AI teases more coding details soon; tops OSS leaderboard for programming tasks.
https://x.com/MistralAI/status/1996580307336638951
Model: https://huggingface.co/mistralai/Mistral-Large-3-2512
Seedream 4.5 jumps to #3 on image-editing leaderboard
https://x.com/arena/status/1996641968005566876
RWKV-7 G0b 13.3B pure RNN beats Qwen3 14B without eval-maxxing
https://x.com/BlinkDL_AI/status/1996556628376850541
Model: https://huggingface.co/BlinkDL/rwkv-7-g0b
DeepSeek-V3.2 Thinking hits 70.6% AUC on 2-needle long-context
https://x.com/DillonUzar/status/1996358865060073913
OneThinker-8B tops Qwen3-VL on 31 image/video benchmarks
https://x.com/rohanpaul_ai/status/1996415663104270701
Paper: https://arxiv.org/abs/2512.03043
Uni-MoE 2.0 tops Qwen2.5 Omni on video & speech
https://x.com/jiqizhixin/status/1996393265743249832
CORE-Bench solved: Opus 4.5 + Claude Code reaches 95%
https://x.com/sayashk/status/1996334941832089732
INTELLECT-3 106B MoE takes #1 on Arena math/code
https://x.com/arena/status/1996324769013391839
Nvidia Orchestrator-8B beats GPT-5 on HLE at 2.5× efficiency
https://x.com/HuggingPapers/status/1996310259695079570
Model: https://huggingface.co/nvidia/Orchestrator-8B
Vending-Bench Arena: Opus 4.5 wins multi-agent competition
https://x.com/andonlabs/status/1996268508926386422
Rosetta Stone for Benchmarks: Epoch AI unifies 100+ evals
https://x.com/EpochAIResearch/status/1996248575400132794
Blog: https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks
Thinking Algorithm Leaderboard launches – GPT-OSS 120B #1
https://x.com/cooper_nyc_/status/1995988121385603467
Leaderboard: https://leaderboard.neurometric.ai
VisualPuzzles: Gemini 3 Pro only 52.7% on knowledge-free logic
https://x.com/yueqi_song/status/1995499844992127276
Project: https://visualpuzzles.github.io
RefineBench: LLMs gain just 1.7% self-refining hard essays
https://x.com/seungonekim/status/1995383422806630863
Site: https://passing2961.github.io/refinebench-page
Structured Prompting + HELM flips benchmark rankings by 4%
Paper: https://arxiv.org/abs/2511.20836
ALIGNEVAL: Judging skill 0.96 correlation with generator quality
https://x.com/carmelkron/status/1994527325023867233
Paper: https://arxiv.org/abs/2511.16043
Nvidia Orchestrator-8B quietly tops HLE (original post)
https://x.com/NielsRogge/status/1994419877927404017
DeepSeekMath-V2 verifier loop beats Gemini DeepThink on IMO
https://x.com/mervenoyann/status/1994015342188757333
Model: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2
PaperTalker turns papers into full narrated presentation videos
https://x.com/ChrisLaubAI/status/1993969771784921384
GitHub: https://github.com/showlab/Paper2Video
DeepWriter (team of 3) hits 50.91% on Humanity’s Last Exam
https://x.com/DeepwriterAI/status/1993803648900755585
AI voting sim: Grok-4 closest to real election results
https://x.com/RaphaelDabadie/status/1993748654675345600
Distilling Efficient Reasoning: 12B from GPT-OSS cuts tokens 75%
https://x.com/omarsar0/status/1993695515595444366
Paper: https://arxiv.org/abs/2511.19333
Grok 4.1 Fast takes #1 on Python coding leaderboard
https://x.com/cb_doge/status/1993697429124956471
Opus 4.5 tops Elicit research QA at 96.5%
https://x.com/stuhlmueller/status/1993476570754040173
Opus 4.5 only 21% on FrontierMath – trails GPT-5.1 & Gemini 3 Pro
https://x.com/EpochAIResearch/status/1993431031765250119
PeptoneBench & PepTron for disordered proteins
https://x.com/NVIDIAHealth/status/1993386709153636643
GitHub: https://github.com/peptoneltd/peptonebench
Gemini 3 Pro 93% GPQA Diamond via organic chemistry surge
https://x.com/EpochAIResearch/status/1993363375108333616
Agent0: Zero-data self-evolving agents +24% on reasoning
https://x.com/rryssf_/status/1992889473911378039
Paper: https://arxiv.org/abs/2511.14460
Agentic Reviewer matches human ICLR feedback correlation 0.42
https://x.com/AndrewYNg/status/1993001922773893273
Elon: Grok 5 vs top LoL pros in 2026 with human constraints
https://x.com/elonmusk/status/1993208505486979327
Claude Opus 4.5 hits 80.9% SWE-Bench, beats GPT-5.1 & Gemini 3 Pro
https://x.com/rohanpaul_ai/status/1993046494904217661
ImagineArt 1.5 (indie) climbs to global #3 image gen
https://x.com/techbymarkandey/status/1992951100438368588
Iterative Refinement: 7M-param RNN beats 671B LLMs on ARC-AGI
https://x.com/burkov/status/1992679461485994144
CritPt physics benchmark: Gemini 3 Pro only 9.1% without tools
https://x.com/MinyangTian1/status/1991913292004995217
Paper: https://arxiv.org/abs/2509.26574
GitHub: https://github.com/CritPt-Benchmark/CritPt

What We’re Hacking This Week

At Recall, we’re always pushing AI development forward internally and externally. Here are the most interesting things that happened this week.

We built Atlas, an internal AI Chief of Staff, to summarise everything and keep our team in sync.

Automated daily summaries from GitHub, Linear, Notion, Slack, etc
Extracts structured requirements from notes and meetings
Partnership research with automated synergy analysis
Real-time web search via Perplexity integration
Automated KPI dashboard and metrics tracking

Our engineering team automated reviews with AI Agents

Multiple AI personas cross-checking each other’s work
Automated code generation through a pipeline system
Autonomously reviews every PR
Successfully identified and fixed the first production bug autonomously

We built a portfolio of trading agents with multiple strategies to simulate trading for our upcoming arena

Crypto trading arena agents on Aerodrome DEX trading agents