<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[In The Arena by Recall]]></title><description><![CDATA[In the Arena is your weekly guide to AI performance — covering benchmarks, evals, and research across every model and skill.]]></description><link>https://newsletter.recall.network</link><image><url>https://substackcdn.com/image/fetch/$s_!B6PI!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a7514eb-0050-475c-8319-370c257ec71a_400x400.png</url><title>In The Arena by Recall</title><link>https://newsletter.recall.network</link></image><generator>Substack</generator><lastBuildDate>Wed, 13 May 2026 21:49:29 GMT</lastBuildDate><atom:link href="https://newsletter.recall.network/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Recall Foundation]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[recallnet@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[recallnet@substack.com]]></itunes:email><itunes:name><![CDATA[Recall]]></itunes:name></itunes:owner><itunes:author><![CDATA[Recall]]></itunes:author><googleplay:owner><![CDATA[recallnet@substack.com]]></googleplay:owner><googleplay:email><![CDATA[recallnet@substack.com]]></googleplay:email><googleplay:author><![CDATA[Recall]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[In the Arena: Week 5]]></title><description><![CDATA[DeepSeek matches GPT-5 for pennies. Open-source closes the gap. Benchmarks get exposed. The smartest model doesn't win anymore&#8212;the most reliable system does.]]></description><link>https://newsletter.recall.network/p/in-the-arena-week-5</link><guid isPermaLink="false">https://newsletter.recall.network/p/in-the-arena-week-5</guid><dc:creator><![CDATA[Sanket]]></dc:creator><pubDate>Wed, 07 Jan 2026 15:11:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8DC-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8DC-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8DC-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8DC-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1632157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/183770683?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8DC-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!8DC-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e53db4e-862a-481c-8351-4cc95e6dc502_2000x1053.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Less data, smarter trades?</h2><p>We expected more information to mean better LLM trading decisions. We were wrong.</p><p>When we stripped four frontier models down to nothing but a price chart &#8211; no order books, volume data or technical indicators &#8211; and compared their $ETH trading performance against counterparts drowning in all types of market intelligence, the chart-readers won. Gemini 3 Pro Chart-Only came on top, returning +9.09% over a few trading days. This was nearly double the returns of its data-rich twin. Of all models tested, three of the top four spots traded on pixels alone. </p><p><strong>This wasn&#8217;t supposed to happen.</strong> Conventional wisdom says more signal means better decisions. But when we dug into their trading behaviours, we noticed something in the data. Chart-only vision models weren&#8217;t just profitable&#8212;they were <em>disciplined</em>. They made fewer trades, extracted more alpha per position, and avoided the churn that burned through the returns of their data-rich counterparts. The models with complete market feeds kept second-guessing themselves, reversing positions, and reacting to noise that the vision agents couldn&#8217;t even see.</p><p>However, here&#8217;s where it gets interesting&#8230; the data-rich models outperformed on risk-adjusted returns. Notably, DeepSeek's data-fed variant posted a Sortino ratio of 2.53, nearly twice the best vision performer. So what's actually happening? Are chart-only models fundamentally blind to risk? Are data-rich ones so paralysed by information that they quickly hedge themselves out of their best trades but also cut their losers early? One week of data can't answer that. Market conditions shift, luck plays a role, and a single run proves nothing.</p><p>So this week we're running the same arena again. Same models. Same asset. Same setup. This week's arena will confirm whether vision-only trading is a genuine edge or a statistical ghost.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://app.recall.network/competitions/0013118d-d65e-45d2-a8ab-5e6f90910a69&quot;,&quot;text&quot;:&quot;Follow The Results&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://app.recall.network/competitions/0013118d-d65e-45d2-a8ab-5e6f90910a69"><span>Follow The Results</span></a></p><div><hr></div><h2><strong>AI Breakthroughs</strong></h2><p><strong>DeepSeek keeps cooking</strong></p><p>DeepSeek dropped a paper on Manifold-Constrained Hyper-Connections that basically says, &#8220;what if we just... made residual paths not explode at scale?&#8221; </p><p>DeepSeek V3.2 is now performing on par with GPT-5 across most standard tests, trailing only slightly on extremely long context retrieval. The Speciale variant secured gold-medal scores on IMO 2025 and IOI 2025. Now, a model trained for a fraction of the cost and open-weights is beating the best closed models on math olympiad problems. </p><p><strong>The open-source situation</strong></p><p>Zuckerberg announced Scout and Maverick in April 2025. The experimental version scored 1417 Elo on LMArena. The vanilla release? Ranked 32nd. Llama 4 Behemoth is still training and supposedly beats GPT-4.5 and Gemini 2.0 Pro on STEM benchmarks. We&#8217;ll see.</p><p>Qwen 3 from Alibaba is now powering 90,000+ enterprises on Alibaba Cloud. GLM-4.7 Thinking from ZAI hit 89% on LiveCodeBench&#8212;matching GPT-5.2. Open models have genuinely closed the gap on coding tasks. The moat around proprietary models is getting thinner every quarter.</p><p><strong>Benchmark theatre continues</strong></p><p>IQuest dropped a 40B model claiming 81.4% on SWE-Bench Verified. SOTA! Revolutionary! Then people actually used it and discovered it loops endlessly and might not even beat a well-tuned 20B. Guess it optimized for the benchmark. Smh.</p><p>Meanwhile, a researcher published reproducible evidence that LLM-as-judge systems have systematic bias. If you&#8217;re using LLM-as-judge for evals, your leaderboard might just be measuring which model the judge likes most.</p><p><strong>Anthropic&#8217;s quiet efficiency flex</strong></p><p>Claude&#8217;s new &#8220;effort parameter&#8221; lets developers control how much computational work the model applies to each task. At medium effort, Opus 4.5 matches Sonnet 4.5&#8217;s best SWE-bench score while using 76% fewer tokens. </p><p><strong>Google&#8217;s multimodal dominance</strong></p><p>Gemini 3 Pro remains the ceiling for multimodal understanding. 81% on MMMU-Pro. 87.6% on Video-MMMU. 72.7% on ScreenSpot-Pro, where Claude 3.5 Sonnet got 36.2%, GPT-5.1 got 3.5%. If your workflow involves complex visual inputs, Google is still king. (We also saw outperformance of Gemini 3 Pro Vision in our trading arena.)</p><p><strong>The real signal this week</strong></p><ul><li><p>Agent Harness design is becoming the actual bottleneck, not model capability: logging decisions, grading outcomes, capturing real-world feedback, managing skill libraries, and maintaining memory modules. The model is the engine. The harness is everything else that determines whether the car actually drives.</p></li><li><p>Prime Intellect&#8217;s Recursive Language Models are pointing in the same direction&#8212;training models to manage their own context by delegating work to sub-models and tools rather than just cramming more tokens into the window. </p></li></ul><p>2026 is shaping up to be the year everyone realizes verification infrastructure matters more than the next 10% benchmark improvement. Exciting times if you&#8217;re building for reliability. Exhausting times if you&#8217;re still chasing leaderboard headlines.</p><div><hr></div><h2><strong>AI Research Rabbit Holes</strong></h2><p><strong>GLM-4.7 emerges as frontier coding model, topping unbenchmaxxable benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ctodotnew/status/2008337079994839311&quot;,&quot;full_text&quot;:&quot;glm-4.7 is officially a frontier coding model\n\nthe unbenchmaxxable benchmark does not lie &quot;,&quot;username&quot;:&quot;ctodotnew&quot;,&quot;name&quot;:&quot;cto.new&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1978145129786384384/Dm_-Gngo_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T00:37:27.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G98H96kbcAEgpa4.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/L8tv2ecF06&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:32,&quot;retweet_count&quot;:38,&quot;like_count&quot;:535,&quot;impression_count&quot;:38988,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://cto.new/bench">https://cto.new/bench</a></p><p><strong>NVIDIA DGX Spark updates deliver 2.6x performance boost with NVFP4, enabling local 235B-parameter models</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/NVIDIAAIDev/status/2008335204167901562&quot;,&quot;full_text&quot;:&quot;NVIDIA DGX Spark keeps getting faster. &#10024;&#9889; \n\nNew software, models, and open-source optimizations deliver up to 2.6&#215; higher performance with NVFP4, run 235B-parameter models locally, and allow multiple workloads to run simultaneously on the desktop.\n\nRead the technical blog &#10145;&#65039; &quot;,&quot;username&quot;:&quot;NVIDIAAIDev&quot;,&quot;name&quot;:&quot;NVIDIA AI Developer&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1836133629694742531/verSRYr8_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T00:30:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G977XmsWAAA9KMb.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/aXpa93x2NE&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:6,&quot;retweet_count&quot;:24,&quot;like_count&quot;:170,&quot;impression_count&quot;:8610,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Miro Thinker 1.5 released: Open-source research agent with interleaved thinking and multi-step analysis</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/xianbao_qian/status/2008133563284406532&quot;,&quot;full_text&quot;:&quot;<span class=\&quot;tweet-fake-link\&quot;>@miromind_ai</span> Miro Thinker 1.5 is out\n\n- Post-trained on top of qwen3\n- Available in both 30A3B and 235A22B\n- Claimed to have great result on BrowserComp\n- Technical report coming soon\n- MiT license\n\nOfficial demo: <a class=\&quot;tweet-url\&quot; href=\&quot;https://dr.miromind.ai/\&quot;>dr.miromind.ai</a>\nHF link: <a class=\&quot;tweet-url\&quot; href=\&quot;https://huggingface.co/collections/miromind-ai/mirothinker-v15\&quot;>huggingface.co/collections/mi&#8230;</a>&quot;,&quot;username&quot;:&quot;Xianbao_QIAN&quot;,&quot;name&quot;:&quot;Tiezhen WANG&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1608827777175912449/3vaCvfND_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-05T11:08:45.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G95Rm8vasAAXV-v.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/ZgU6QzNExe&quot;}],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;@miromind_ai released their open source latest model on research agent.\n\n- interleaved thinking with multi-step analysis, powered by RL\n- 256k context window\n- handles up to 600 tool calls per task\n- comes in 8B, 30B, and 72B for different compute budgets\n- MiT license\n\nModel&quot;,&quot;username&quot;:&quot;Xianbao_QIAN&quot;,&quot;name&quot;:&quot;Tiezhen WANG&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1608827777175912449/3vaCvfND_normal.jpg&quot;},&quot;reply_count&quot;:9,&quot;retweet_count&quot;:35,&quot;like_count&quot;:270,&quot;impression_count&quot;:42269,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://huggingface.co/collections/miromind-ai/mirothinker-v15">https://huggingface.co/collections/miromind-ai/mirothinker-v15</a></p><p><strong>Falcon H1R-7B launched: Mamba-Transformer hybrid outperforming larger models in math and coding</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/mervenoyann/status/2008140906814468442&quot;,&quot;full_text&quot;:&quot;TII just dropped Falcon H1R-7B &#128588;&#127995; \n\na new reasoning model outperforming others in math and coding with only 7B params with 256k context window &#128293;\n\nit's a mamba-transformers hybrid model so more efficient per throughput and memory too &#129321; &quot;,&quot;username&quot;:&quot;mervenoyann&quot;,&quot;name&quot;:&quot;merve&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1982142240571920384/hla62KCQ_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-05T11:37:56.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G95V543WoAEzerY.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/wgazrhqibE&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:26,&quot;retweet_count&quot;:62,&quot;like_count&quot;:431,&quot;impression_count&quot;:36837,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://huggingface.co/blog/tiiuae/falcon-h1r-7b">https://huggingface.co/blog/tiiuae/falcon-h1r-7b</a></p><p><strong>Agent Harness introduced: Infrastructure for managing long-running AI agent tasks and evaluations</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/_philschmid/status/2008175408923959574&quot;,&quot;full_text&quot;:&quot;If 2025 was beginning of agents, 2026 will be around Agent Harnesses. An Agent Harness is the infrastructure that wraps around an AI model to manage long-running tasks. It is not the agent itself. \n\nIt operates at a higher level than agent frameworks. The harness provides prompt &quot;,&quot;username&quot;:&quot;_philschmid&quot;,&quot;name&quot;:&quot;Philipp Schmid&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1831321531852496896/1yBZG884_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-05T13:55:02.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G95439hXIAAfWa6.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/7PkJf2qfqF&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:79,&quot;retweet_count&quot;:142,&quot;like_count&quot;:958,&quot;impression_count&quot;:129434,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://www.philschmid.de/agent-harness-2026">https://www.philschmid.de/agent-harness-2026</a></p><p><strong>AntAngelMed released: Efficient 6.1B MoE medical LLM with 128K context and strong benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/AdinaYakup/status/2008184087484211374&quot;,&quot;full_text&quot;:&quot;AntAngelMed &#127973; open medical LLM from Ant group <span class=\&quot;tweet-fake-link\&quot;>@ant_oss</span> \n\n<a class=\&quot;tweet-url\&quot; href=\&quot;https://huggingface.co/MedAIBase/AntAngelMed\&quot;>huggingface.co/MedAIBase/AntA&#8230;</a>\n&#10024; Efficient MoE: 6.1B active params &#8776; 40B dense model performance\n&#10024; 128K long context \n&#10024; Chinese/English support\n&#10024; Three-stage medical training for strong reasoning, safety, and clinical &quot;,&quot;username&quot;:&quot;AdinaYakup&quot;,&quot;name&quot;:&quot;Adina Yakup&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1974110789146718209/j2fGFMxT_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-05T14:29:31.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G96AVQXWsAAScYh.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/z2UrbmJDle&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:8,&quot;retweet_count&quot;:19,&quot;like_count&quot;:161,&quot;impression_count&quot;:6714,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://huggingface.co/MedAIBase/AntAngelMed">https://huggingface.co/MedAIBase/AntAngelMed</a></p><p><strong>InternVLA-A1 open-sourced: Unified VLA model excelling in dynamic environments and benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ModelScope2022/status/2008137224575992238&quot;,&quot;full_text&quot;:&quot;&#129302; Introducing InternVLA-A1 &#8212; now fully open-sourced!\n\nMany VLA models follow instructions well in static scenes&#8230; but struggle in dynamic environments (conveyor belts, rotating platforms, multi-robot setups). Why? They see the present&#8212;but can&#8217;t imagine the future.\n\nInternVLA-A1 &quot;,&quot;username&quot;:&quot;ModelScope2022&quot;,&quot;name&quot;:&quot;ModelScope&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1954838522696982528/_vUSgyLj_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-05T11:23:18.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/egec42elhzo9k6dnl4ow&quot;,&quot;link_url&quot;:&quot;https://t.co/xglQ4BXUTJ&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:13,&quot;retweet_count&quot;:86,&quot;like_count&quot;:539,&quot;impression_count&quot;:34051,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2008136943092346880/vid/avc1/1280x720/Xaaovv39217K6Ygl.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://modelscope.cn/models/internrobotics/InternVLA-A1">https://modelscope.cn/models/internrobotics/InternVLA-A1</a></p><p><strong>LFM2.5 family released: Tiny on-device models with improved quality and multimodal support</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/liquidai/status/2008385292244242942&quot;,&quot;full_text&quot;:&quot;Today, we release LFM2.5, our most capable family of tiny on-device foundation models.\n\nIt&#8217;s built to power reliable on-device agentic applications: higher quality, lower latency, and broader modality support in the ~1B parameter class.\n\n&amp;gt; LFM2.5 builds on our LFM2 &quot;,&quot;username&quot;:&quot;liquidai&quot;,&quot;name&quot;:&quot;Liquid AI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1702611993327251456/PUUuOA4W_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T03:49:02.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G98lWcrWgAAFo5j.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/9BjcZPktS6&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:57,&quot;retweet_count&quot;:199,&quot;like_count&quot;:1259,&quot;impression_count&quot;:130827,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai">https://www.liquid.ai/blog/introducing-lfm2-5-the-next-generation-of-on-device-ai</a></p><p><a href="https://huggingface.co/collections/LiquidAI/lfm25">https://huggingface.co/collections/LiquidAI/lfm25</a></p><p><strong>ThinkGen: Game-theoretic controller for diffusion models improving generation benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/gm8xx8/status/2008401322387673591&quot;,&quot;full_text&quot;:&quot;ThinkGen: Generalized Thinking for Visual Generation\n\nThinkGen: &#8220;CoT as a controller&#8221; for diffusion image gen/edit. The contribution isn&#8217;t images per se, but separable RL that learns the planner&#8211;renderer interface.\n\nDecoupled stack (system, not a single unified model):\n- MLLM &quot;,&quot;username&quot;:&quot;gm8xx8&quot;,&quot;name&quot;:&quot;&#120464;&#120106;&#120830;&#120481;&#120481;&#120830;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1723513473294835712/pvPLgqp3_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T04:52:44.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G99GW7mXcAAiq0o.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/WMhDr9aFDX&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:4,&quot;retweet_count&quot;:22,&quot;like_count&quot;:83,&quot;impression_count&quot;:3621,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://github.com/jiaosiyuu/ThinkGen">https://github.com/jiaosiyuu/ThinkGen</a></p><p><a href="https://arxiv.org/abs/2512.23568">https://arxiv.org/abs/2512.23568</a></p><p><strong>Mi:dm K 2.5 Pro launched: Proprietary reasoning model strong in agentic tool-use benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ArtificialAnlys/status/2008415401890271446&quot;,&quot;full_text&quot;:&quot;Korea Telecom has launched Mi:dm K 2.5 Pro, a proprietary reasoning model that scores 48 on the Artificial Analysis Intelligence Index\n\nKey benchmarking takeaways:\n\n&#10148; Strength in Agentic Tool Use: Mi:dm K 2.5 Pro scores 87% on &#964;&#178;-Bench Telecom, demonstrating strong performance &quot;,&quot;username&quot;:&quot;ArtificialAnlys&quot;,&quot;name&quot;:&quot;Artificial Analysis&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1810946341511766016/3mg9KIaQ_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T05:48:41.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G99SmhnWcAASU8o.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/jMIa2qlvu1&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:5,&quot;retweet_count&quot;:5,&quot;like_count&quot;:132,&quot;impression_count&quot;:11289,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br>https://artificialanalysis.ai/</p><p><strong>Opus 4.5 leads re-vamped LiveBench leaderboard reflecting real-world LLM performance</strong> </p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/bindureddy/status/2007938526453928019&quot;,&quot;full_text&quot;:&quot;Opus 4.5 Tops The Re-Vamped LiveBench Leaderboard, Which Reflects Real World LLM Performance\n\nOver the holidays, we re-vamped the LiveBench benchmark to prevent gaming.\n\nOpus 4.5 tops the new benchmark with Codex and Gemini 3 hot on its heels. Kimi K2 tops the open-weight models, &quot;,&quot;username&quot;:&quot;bindureddy&quot;,&quot;name&quot;:&quot;Bindu Reddy&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1443737943684763651/32WHA-kg_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-04T22:13:45.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G92hAFXWEAAZLnZ.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/22an4TUHgd&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:35,&quot;retweet_count&quot;:29,&quot;like_count&quot;:336,&quot;impression_count&quot;:41264,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>SWE-EVO benchmark reveals gaps in long-horizon coding agent performance</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/omarsar0/status/2007825862721232956&quot;,&quot;full_text&quot;:&quot;Benchmarking Long-Horizon Coding Agents\n\nAI coding agents look impressive on current coding benchmarks. But those benchmarks often optimize and test for the wrong thing.\n\nThis new research introduces SWE-EVO, a benchmark for long-horizon software evolution.\n\nUp to 80% of software &quot;,&quot;username&quot;:&quot;omarsar0&quot;,&quot;name&quot;:&quot;elvis&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-04T14:46:03.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G906-snWkAADIYx.png&quot;,&quot;link_url&quot;:&quot;https://t.co/e5HZQPcLEc&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:19,&quot;retweet_count&quot;:37,&quot;like_count&quot;:196,&quot;impression_count&quot;:38750,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://arxiv.org/abs/2512.16553">https://arxiv.org/abs/2512.16553</a></p><p><strong>TeleChat3-105B-A4.7B-Thinking open-sourced: Sparse MoE leading in coding benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ModelScope2022/status/2008499004158411187&quot;,&quot;full_text&quot;:&quot;&#128640; TeleChat3-105B-A4.7B-Thinking is now open source!\n\nA 105B sparse MoE model with fine-grained routing:\n- 192 experts, only 4 activated per token (4.7B active params)\n- Trained end-to-end on domestic compute\n- Strong across code, math, agents, writing &#8212; check HumanEval-X (92.7%) &quot;,&quot;username&quot;:&quot;ModelScope2022&quot;,&quot;name&quot;:&quot;ModelScope&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1954838522696982528/_vUSgyLj_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T11:20:53.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9-exrabcAQpEub.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/npriZvYBWj&quot;},{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/v6z0krj8ahtzt1v8q8wf&quot;,&quot;link_url&quot;:&quot;https://t.co/npriZvYBWj&quot;},{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/vjoscjthamc5axlz2vjx&quot;,&quot;link_url&quot;:&quot;https://t.co/npriZvYBWj&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:10,&quot;retweet_count&quot;:38,&quot;like_count&quot;:234,&quot;impression_count&quot;:12981,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2008498765309919232/vid/avc1/1410x720/WYLt14otfqToRKFt.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://modelscope.cn/models/TeleChat/TeleChat3-105B-A4.7B-Thinking">https://modelscope.cn/models/TeleChat/TeleChat3-105B-A4.7B-Thinking</a></p><p><strong>NousCoder-14b: Competitive olympiad programming model, scores Pass@1 accuracy of 67.87%, +7.08% over Qwen's baseline</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/NousResearch/status/2008624560950657490?s=20&quot;,&quot;full_text&quot;:&quot;&quot;,&quot;username&quot;:&quot;NousResearch&quot;,&quot;name&quot;:&quot;Nous Research&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1816254738234761216/TX7TW-Mp_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T19:39:48.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:2,&quot;like_count&quot;:43,&quot;impression_count&quot;:4072,&quot;expanded_url&quot;:{&quot;url&quot;:&quot;http://www.nousresearch.com/nouscoder-14b&quot;,&quot;title&quot;:&quot;Introducing NousCoder-14b&quot;,&quot;description&quot;:&quot;Nous Research introduces NousCoder-14B, a competitive olympiad programming model post-trained on Qwen3-14B via reinforcement learning.The full stack is released publicly: model weights, open RL environment + eval harness, and @wandb logs; we also document the pipelined verification setup and parallelization experiments so others can reproduce the training stack.&quot;,&quot;domain&quot;:&quot;nousresearch.com&quot;,&quot;image&quot;:&quot;https://pbs.substack.com/news_img/2008624396659802112/25M7BVU7?format=jpg&amp;name=orig&quot;},&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><a href="https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/">https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/</a></p><p><strong>AI Agent Systems: Comprehensive survey on architectures, applications, and evaluation methods</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/dair_ai/status/2008600721554632830&quot;,&quot;full_text&quot;:&quot;AI Agent Systems: Architectures, Applications, and Evaluation. &quot;,&quot;username&quot;:&quot;dair_ai&quot;,&quot;name&quot;:&quot;DAIR.AI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1643277398522187778/31dedbLo_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-06T18:05:04.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9_7thybcAM3pb8.png&quot;,&quot;link_url&quot;:&quot;https://t.co/JEtqSmB3RA&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:11,&quot;retweet_count&quot;:70,&quot;like_count&quot;:332,&quot;impression_count&quot;:20493,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><a href="https://arxiv.org/abs/2601.01743v1">https://arxiv.org/abs/2601.01743v1</a></p><p><strong>Elon Musk claims Grok 5 nears a perfect score on Humanity&#8217;s Last Exam benchmark</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/chatgpt21/status/2008776134591475944&quot;,&quot;full_text&quot;:&quot;Elon says Grok 5 will have a perfect score or a &#8220;near perfect&#8221; score on Humanity&#8217;s Last Exam &quot;,&quot;username&quot;:&quot;chatgpt21&quot;,&quot;name&quot;:&quot;Chris&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2001137569275256832/bO_iEVnF_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-07T05:42:06.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/ba6maiukeatqjlnzizwo&quot;,&quot;link_url&quot;:&quot;https://t.co/qYsW6jjb6E&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:6,&quot;retweet_count&quot;:10,&quot;like_count&quot;:102,&quot;impression_count&quot;:4262,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2008776069734707200/vid/avc1/1072x720/5k8qC0esiMuXg3RX.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><div><hr></div><h2><strong>What We&#8217;re Hacking This Week</strong></h2><p>You can&#8217;t improve what you can&#8217;t repeatedly measure. Live trading arenas generate insights, but market conditions never repeat&#8212;so isolating why an agent won or lost becomes guesswork. We&#8217;ve been building infrastructure these past few weeks to fix that.</p><p><strong>Replay Lab</strong></p><ul><li><p>Microservice for generating labelled training data for models that predict financial market events</p></li><li><p>Hybrid storage: eager for raw OHLCV candle data, lazy computation for indicators and annotations</p></li><li><p>Ground truth annotations for agent scoring, dump events, pump events, local extremes, sweep reclaims</p></li><li><p>On-demand chart generation with technical indicators and annotation overlays</p></li><li><p>Agents can self-score against historical events instead of waiting for live validation</p></li></ul><p><strong>Backtest Lab</strong></p><ul><li><p>Download historical market data(via replay labs), backtest strategies with realistic fees and slippage</p></li><li><p>Scan all strategies against a dataset to find top performers, rank and compare across multiple backtests</p></li><li><p>Architecture exposes typed inputs/outputs for MCP servers, REST APIs, or direct programmatic use</p></li></ul><p>Together, they form a single pipeline: Replay Lab generates the labelled market data and ground truth, then Backtest Lab runs agent strategies against it systematically. Stress-test against known edge cases, isolate failure modes, iterate, and validate to train the best trading agents.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena by Recall! </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[In the Arena: Week 4]]></title><description><![CDATA[Anthropic and OpenAI shipped major benchmarking frameworks, open models are closing the gap on coding evals, and we're testing whether AI can trade profitably by reading a chart alone.]]></description><link>https://newsletter.recall.network/p/in-the-arena-week-4</link><guid isPermaLink="false">https://newsletter.recall.network/p/in-the-arena-week-4</guid><dc:creator><![CDATA[Sanket]]></dc:creator><pubDate>Fri, 02 Jan 2026 15:48:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RCtZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RCtZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RCtZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RCtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1631728,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/182945096?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RCtZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!RCtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd63b74-93c3-40ed-9a14-fae706e12c62_2000x1053.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After running fourteen LLM instances through a single-asset trading competition with real capital, we reached an uncomfortable conclusion: frontier models aren't ready to trade your money unsupervised. </p><p>But buried in the data was something interesting: some agents showed genuine crash detection skill. The models couldn't capture opportunity, but they could recognise danger. That finding shifted our focus: if LLMs possess narrow skills that surface under specific conditions, what other capabilities might we be missing by testing them as general-purpose traders? </p><p>We decided to explore granular skills, starting with something fundamental&#8212;chart reading. Traders have stared at candlesticks for decades, developing intuitions about momentum and reversal patterns, so we stripped four frontier models down to just the ETH chart and matched them against counterparts with full market data. Both groups are trading live on Aerodrome until January 6th. </p><p>Can models trade profitably by just looking at a chart? Track how vision-only agents stack up against counterparts with complete market data.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://app.recall.network/competitions/39ba2008-cc5d-4a93-8f8e-1eb47c835462&quot;,&quot;text&quot;:&quot;Single-Asset Trading On Aerodrome&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://app.recall.network/competitions/39ba2008-cc5d-4a93-8f8e-1eb47c835462"><span>Single-Asset Trading On Aerodrome</span></a></p><h2><strong>AI Breakthroughs</strong></h2><p><strong><a href="https://www.anthropic.com/research/bloom">ANTHROPIC SHIPPED AN OPEN-SOURCE TOOL TO CATCH MISALIGNED MODELS. FINALLY, RECEIPTS.</a></strong></p><p>Anthropic dropped Bloom this week, an open-source framework for generating automated behavioural evaluations: Tests for self-preferential bias, sabotage tendencies, and delusional sycophancy. </p><p>The tool efficiently differentiates between aligned and misaligned models and correlates strongly with human judgment. </p><p><strong><a href="https://openai.com/index/frontierscience/">OPENAI&#8217;S FRONTIERSCIENCE: WHEN BENCHMARKS MEET WET LABS</a></strong></p><p>OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild: it validates answers in actual laboratories.</p><p>Model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn&#8217;t a standardised test.</p><p><strong><a href="https://x.com/Zai_org/status/2003156119087382683?s=20">OPEN MODELS ARE CLOSING THE GAP (ON THE EVALS THAT MATTER)</a></strong></p><p>GLM-4.7: 73.8% on SWE-bench Verified. DeepSeek-V3: 73.1%. Claude Sonnet 4.5: 77.2%.</p><p>That gap is shrinking fast. On coding tasks, where evaluation is most concrete and gameable tricks are hardest to hide, open models are now within striking distance. MiniMax M2.1 shipped across six developer surfaces in a week. Claims ~1/10 Opus pricing. </p><div><hr></div><h2><strong>AI Research Rabbitholes</strong></h2><p><strong>Liquid AI releases LFM2-2.6B-Exp, a 3B model outperforming much larger models on instruction, knowledge, and math benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/liquidai/status/2004190178068296181&quot;,&quot;full_text&quot;:&quot;Meet the strongest 3B model on the market.\n\nLFM2-2.6B-Exp is an experimental checkpoint built on LFM2-2.6B using pure reinforcement learning.\n\n&amp;gt; Consistent improvements in instruction following, knowledge, and math benchmarks\n&amp;gt; Outperforms other 3B models in these domains\n&amp;gt; Its &quot;,&quot;username&quot;:&quot;liquidai&quot;,&quot;name&quot;:&quot;Liquid AI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1702611993327251456/PUUuOA4W_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-25T13:59:09.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9BDxHdW4AA9GeF.png&quot;,&quot;link_url&quot;:&quot;https://t.co/GTFTFcY1BD&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:85,&quot;retweet_count&quot;:237,&quot;like_count&quot;:2145,&quot;impression_count&quot;:644702,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Model Download:</strong> <a href="https://huggingface.co/liquidai">https://huggingface.co/liquidai</a> </p><p><strong>New benchmark for web search agents: Needle in the Web shows most systems fail on vague queries</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/rohanpaul_ai/status/2006322945279615243&quot;,&quot;full_text&quot;:&quot;New Tsinghua University paper builds a new test for web search agents and shows they still fail on vague queries.\n\nNeedle in the Web is a new benchmark that exposes how badly search agents handle vague web queries.\n\nIt proposes a test where an agent gets 3 fuzzy clues from a real &quot;,&quot;username&quot;:&quot;rohanpaul_ai&quot;,&quot;name&quot;:&quot;Rohan Paul&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1816185267037859840/Fd18CH0v_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-31T11:14:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9daf7_agAAMAqB.png&quot;,&quot;link_url&quot;:&quot;https://t.co/U7gbM0mIu9&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:1,&quot;like_count&quot;:11,&quot;impression_count&quot;:1468,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Paper</strong>: <a href="https://arxiv.org/abs/2512.16553">https://arxiv.org/abs/2512.16553</a></p><p><strong>WeDLM-8B diffusion language model launched with parallel decoding, beating Qwen3-8B on multiple benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/victormustar/status/2005657141529764348&quot;,&quot;full_text&quot;:&quot;WeDLM-8B: a diffusion language model with parallel decoding &#128064;\n\n&#128313;Beats Qwen3-8B-Instruct on 5/6 benchmarks\n&#128313;3-6&#215; faster on math reasoning (vs vLLM Qwen3-8B)\n&#128313;Native KV cache &amp;amp; FlashAttention support\n\n<a class=\&quot;tweet-url\&quot; href=\&quot;https://huggingface.co/tencent/WeDLM-8B-Instruct\&quot;>huggingface.co/tencent/WeDLM-&#8230;</a>&quot;,&quot;username&quot;:&quot;victormustar&quot;,&quot;name&quot;:&quot;Victor M&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1099983311101984768/p7dZK4S__normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-29T15:08:20.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:11,&quot;retweet_count&quot;:67,&quot;like_count&quot;:429,&quot;impression_count&quot;:47196,&quot;expanded_url&quot;:{&quot;url&quot;:&quot;https://huggingface.co/tencent/WeDLM-8B-Instruct&quot;,&quot;title&quot;:&quot;tencent/WeDLM-8B-Instruct &#183; Hugging Face&quot;,&quot;description&quot;:&quot;We&#8217;re on a journey to advance and democratize artificial intelligence through open source and open science.&quot;,&quot;domain&quot;:&quot;huggingface.co&quot;,&quot;image&quot;:&quot;https://pbs.substack.com/news_img/2005514893370478592/EaimGFk7?format=jpg&amp;name=orig&quot;},&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Model Download:</strong> <a href="https://huggingface.co/tencent/WeDLM-8B-Instruct">https://huggingface.co/tencent/WeDLM-8B-Instruct</a></p><p><strong>DeepMind&#8217;s parallel verification loops outperform chain-of-thought reasoning by up to 52% on complex benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/aiwithjainam/status/2005629156214853893&quot;,&quot;full_text&quot;:&quot;The results are insane:\n\nOn complex reasoning benchmarks:\n\n- 37% improvement over standard chain-of-thought\n- 52% better at catching logical errors\n- 3x faster convergence to correct solutions\n\nThis isn't incremental. It's architectural. &quot;,&quot;username&quot;:&quot;aiwithjainam&quot;,&quot;name&quot;:&quot;Jainam Parmar&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1884237135492079616/mGkpS63i_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-29T13:17:08.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9VtFtka4AET_-o.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/yzNqPEQ4Y0&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:1,&quot;retweet_count&quot;:6,&quot;like_count&quot;:71,&quot;impression_count&quot;:10510,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Boogiebench introduced: New leaderboard for evaluating LLM music composition using Strudel JS</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/status_effects/status/2006092588382613759&quot;,&quot;full_text&quot;:&quot;How well can language models like Claude Opus and GPT-5.2 write music? \n\nIntroducing boogiebench: vote in anonymized LLM music composition battles. \n\nUnlike Suno, LLMs haven't been trained explicitly on this task, making it a nice generalization test (coding, aesthetics, temporal &quot;,&quot;username&quot;:&quot;status_effects&quot;,&quot;name&quot;:&quot;Nick Levine&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2002736613667778560/TGGukyuy_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-30T19:58:39.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/qkgjegbsrcdz932lptkp&quot;,&quot;link_url&quot;:&quot;https://t.co/um2BmfGTAU&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:5,&quot;retweet_count&quot;:6,&quot;like_count&quot;:26,&quot;impression_count&quot;:3536,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2006092534141915137/vid/avc1/1280x720/N_xhFI0nHZSmgYgR.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Project:</strong> <a href="https://www.boogiebench.com/">https://www.boogiebench.com/</a> </p><p><strong>GPT-5.2 achieves 29.2% on FrontierMath, a highly challenging math benchmark</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/kimmonismus/status/2006155487637909828&quot;,&quot;full_text&quot;:&quot;GPT-5.2 pro with astonishing 29,2% in frontier math.\n\nngl thats impressive. Especially because this is one of the toughest benchmarks of all. &quot;,&quot;username&quot;:&quot;kimmonismus&quot;,&quot;name&quot;:&quot;Chubby&#9832;&#65039;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1728327996375719936/RW7VBJfD_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-31T00:08:35.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9dLyMPaYAAXUvT.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/uhg06uZfhR&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:14,&quot;retweet_count&quot;:33,&quot;like_count&quot;:356,&quot;impression_count&quot;:23215,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Report:</strong> <a href="https://epoch.ai/frontiermath">https://epoch.ai/frontiermath</a></p><p><strong>SpatialBench launched: New benchmark for AI agents in spatial biology data analysis, revealing low base-model accuracy</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/kenbwork/status/2004573968523743319&quot;,&quot;full_text&quot;:&quot;2026 will be the year of agents in biology. But we need better benchmarks.\n\nWe worked with scientists to turn real world analysis into verifiable problems. SpatialBench stratifies frontier models, shows harnesses matter, and reveals distinct failure modes between model families: &quot;,&quot;username&quot;:&quot;kenbwork&quot;,&quot;name&quot;:&quot;Kenny Workman&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1802186788146126848/imWpvC9i_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-26T15:24:11.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9GqSvVaQAEHgFw.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/9XbWyLTitl&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:6,&quot;retweet_count&quot;:21,&quot;like_count&quot;:140,&quot;impression_count&quot;:92345,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Paper:</strong> <a href="https://arxiv.org/abs/spatialbench-paper">https://arxiv.org/abs/spatialbench-paper</a> | <strong>Github:</strong> <a href="https://github.com/latchbio/spatialbench">https://github.com/latchbio/spatialbench</a></p><p><strong>QwenLong-L1.5 open-sourced: 30B MoE model leading in long-context reasoning benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ModelScope2022/status/2003370363590226313&quot;,&quot;full_text&quot;:&quot;&#128640; New on ModelScope: QwenLong-L1.5 is now fully open-source!\n\nA 30B model (3B active params) that matches GPT-5 &amp;amp; Gemini-2.5-Pro in long-context reasoning.\n\n&#128293; Key wins:\n&#9989; +31.7 pts on OpenAI&#8217;s MRCR (128K context &#8594; SOTA across all models)\n&#9989; Matches Gemini-2.5-Pro on 6 major &quot;,&quot;username&quot;:&quot;ModelScope2022&quot;,&quot;name&quot;:&quot;ModelScope&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1954838522696982528/_vUSgyLj_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-23T07:41:30.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G81mhMCbIAAmM20.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/eR4AtoFdvd&quot;},{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G81mpWmagAADWOP.png&quot;,&quot;link_url&quot;:&quot;https://t.co/eR4AtoFdvd&quot;},{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G81mqpQaAAALkge.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/eR4AtoFdvd&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:14,&quot;retweet_count&quot;:86,&quot;like_count&quot;:610,&quot;impression_count&quot;:42588,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Paper:</strong> <a href="https://arxiv.org/abs/2512.13313">https://arxiv.org/abs/2512.13313</a> | <strong>Model:</strong> <a href="https://modelscope.cn/models/qwen/QwenLong-L1.5">https://modelscope.cn/models/qwen/QwenLong-L1.5</a></p><p><strong>BENTO benchmark released for evaluating classical and AI docking tools in drug design</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/BiologyAIDaily/status/2006286956909809761&quot;,&quot;full_text&quot;:&quot;BENTO: Benchmarking Classical and AI Docking on Drug Design&#8211;Relevant Data\n\n1. A new benchmark study, BENTO, evaluates 11 tools for predicting protein-ligand interactions, including classical docking methods, deep learning models, and co-folding algorithms. It highlights the &quot;,&quot;username&quot;:&quot;BiologyAIDaily&quot;,&quot;name&quot;:&quot;Biology+AI Daily&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1826434466648133632/KAWVyOFb_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-31T08:51:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G9fDWTqakAA-FBd.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/JFlLi85uDY&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:1,&quot;retweet_count&quot;:4,&quot;like_count&quot;:32,&quot;impression_count&quot;:1115,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Paper:</strong> <a href="https://arxiv.org/abs/bento-paper">https://arxiv.org/abs/bento-paper</a></p><p><strong>MiniMax M2.1 matches frontier performance at low cost, highlighted in coding and reliability benchmarks</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/Kelsey_Asami/status/2006323944891224472&quot;,&quot;full_text&quot;:&quot;I just watched Theo&#8217;s live breakdown of MiniMax M2.1, and it&#8217;s genuinely impressive.\n\nThis isn&#8217;t just another model release, <span class=\&quot;tweet-fake-link\&quot;>@MiniMax__AI</span> M2.1 matches Claude Opus 4.1 level performance at just 1/60th of the price, redefining what &#8220;competitive&#8221; even means.\n\nWith strong reliability &quot;,&quot;username&quot;:&quot;Kelsey_Asami&quot;,&quot;name&quot;:&quot;Kelsey&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1971240411021651968/QfXBd5Wj_normal.jpg&quot;,&quot;date&quot;:&quot;2025-12-31T11:17:58.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/jxo0evf0se7kr5kvygoj&quot;,&quot;link_url&quot;:&quot;https://t.co/Y6D5o86k5o&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:12,&quot;retweet_count&quot;:14,&quot;like_count&quot;:36,&quot;impression_count&quot;:6903,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2006323894249230336/vid/avc1/1470x720/t_VQHZrlrtj3R4Mm.mp4?tag=14&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><h2><strong>What We&#8217;re Hacking This Week</strong></h2><p><strong>REPLAY LAB NOW GENERATES CHART IMAGES. THIS MATTERS MORE THAN IT SOUNDS.</strong></p><p>Replay Lab can now generate candlestick charts for any asset, any timeframe, any time window.</p><p>You can overlay indicators. You can overlay annotations. Request &#8220;dump events&#8221; with the exact window, threshold, candle width, and volume adjustment you want. Define an event, visualise it, score agents against it. Repeat.</p><p><strong>WE&#8217;RE TEACHING AI TO READ CHARTS LIKE TRADERS. IT&#8217;S GOING ABOUT AS WELL AS YOU&#8217;D EXPECT.</strong></p><p>New experiment launched on December 30th: Can LLMs analyse chart images the way human traders do?</p><p>We&#8217;re running a baseline comparison alongside, using the instances of the same models and capital, but with full market data. By January 6th, we&#8217;ll know if vision-based trading is a real capability or just vibes with extra steps.</p><p><strong>30 MODELS CALLING BITCOIN BOTTOMS. SCORED AUTOMATICALLY. RUNNING LIVE.</strong></p><p>We pushed the benchmark harness to run ~30 models on a simple task: on historical BTC windows, call whether a local bottom has been seen recently. Every 15 minutes, it pulls Replay Lab state, runs the bottom callers, and updates current predictions, per-model track records, and performance by horizon.</p><p>Ground truth comes straight from the annotations API. The scoring loop is end-to-end and consistent. </p><p><strong>AGENT TEMPLATES: THE RAMP TO BENCHMARKS</strong></p><p>We kept building out agent templates that show the progression we want for Replay Lab-based agents:</p><ol><li><p>Guess Wheel of Fortune</p></li><li><p>Create Puzzle and Guess</p></li><li><p>Multi-round Guessing with Scoring</p></li><li><p>LLM Predict BTC Dumps on Replay Lab</p></li><li><p>Matrix of LLMs Predict Dumps and Are Scored</p></li><li><p>Matrix does 3-Leg EV Prediction</p></li><li><p>30-Agent BTC Bottom Prediction Benchmark</p></li></ol><p>Early templates are intentionally simple so people can understand the loop. Later templates call Replay Lab and turn into a benchmark harness where you can run multiple models and score them in one run.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena by Recall! </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[In the Arena: Week 3]]></title><description><![CDATA[OpenAI's "most capable model ever" costs 40% more and thinks worse than 5.1. Meanwhile, we got tired of waiting for markets so we created a faster training framework for trading agents.]]></description><link>https://newsletter.recall.network/p/in-the-arena-week-3</link><guid isPermaLink="false">https://newsletter.recall.network/p/in-the-arena-week-3</guid><dc:creator><![CDATA[Sanket]]></dc:creator><pubDate>Mon, 22 Dec 2025 18:08:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!w3Yh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w3Yh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w3Yh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w3Yh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1631576,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181881044?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w3Yh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!w3Yh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704b1e51-eac1-4d44-869b-55d312f3b275_2000x1053.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>What if improving trading agents didn&#8217;t require waiting for markets?</strong></h2><p>Usually if you want your trading agent to learn from a week of live market data, you have to wait a week. <em>That&#8217;s insane.</em></p><p>The entire thesis of AI agents is compounding improvement: run experiments, learn, iterate, re-deploy. But the feedback loop for trading agents is gated by real-time data. You have to run a trading strategy, and wait a week for results. Test a hypothesis, and pray the market gives you the conditions you need.</p><p>We got tired of waiting. So we decided to build Replay: an experimental training tool to accelerate learning and improvement for the trading agents we&#8217;re building internally at Recall.</p><p><strong>The idea for Replay is simple:</strong> turn &#8220;wait for the market&#8221; into &#8220;run the same hour 200 times.&#8221;</p><p>To do this, we captured 1-minute OHLCV candles plus order book snapshots and stored them in a database. Then we made it possible to query any timeframe (1m, 5m, 15m, 1h, 4h). Need an indicator? Our tool computes it, persists it, and adds it to the dataset. Need performance annotations for agent scoring? Simply define an event and it backfills across your entire history. Done.</p><p>Instead of one agent, one week, one datapoint, we can now run 100 agents on the same week of data in a few hours!</p><p>We&#8217;re not finished making improvements, so Replay stays internal&#8230; for now. We&#8217;re still figuring out the right abstraction for the collection of agent &#8220;skills&#8221; when you unbundle trading from execution. In this setup, there&#8217;s a risk of agents overfitting to replayed history in ways that don&#8217;t transfer to live markets. We&#8217;re still determining how best to reliably detect that. Other questions we&#8217;re considering: Can you even score intuition, or only measurable actions? What are the minimum performance indicators that makes replay useful versus misleading?</p><p>We don&#8217;t have all the answers yet, but we&#8217;re now able to run experiments and get immediate results without waiting for markets. That alone changes what&#8217;s possible.</p><div><hr></div><h2><strong>AI Breakthroughs</strong></h2><p><strong>THE GPT-5.2 HANGOVER </strong></p><p>Remember last week when GPT-5.2 dropped and everyone lost their minds? 90.5% on ARC-AGI. 100% on AIME. The benchmarks looked like a victory lap. OpenAI called it their most capable model ever.</p><p>Well, the vibe check came in. And it&#8217;s... complicated.</p><p>Users report that GPT-5.2 feels &#8220;heavily constrained,&#8221; engages in blame-shifting when challenged, and has somehow deteriorated at following instructions. It tops LMArena but ranks 17th on SimpleBench, a common-sense reasoning test, scoring lower than GPT-5.1. The model that was supposed to be smarter is now struggling with trick questions that its predecessor handled fine.</p><p>The kicker? It costs 40% more &#8211; $14 per 1M tokens versus $10 for 5.1.</p><p><strong><a href="https://x.com/gdb/status/1999454952801075353?s=20">BENCHMARKS HAVE AN EXPIRATION DATE. IT&#8217;S SHORTER THAN YOU THINK</a>.</strong></p><p>Here&#8217;s something the leaderboards won&#8217;t tell you: the benchmarks themselves are dying. Community consensus is settling around a brutal truth that useful benchmark half-life is measured in months, not years. All of them: AIME, ARC, the coding staples. </p><p>And they&#8217;re saturating. Models are hitting ceilings not because they&#8217;ve mastered reasoning, but because they&#8217;ve memorized the test. </p><p><strong>3-6 months.</strong> That&#8217;s how long a benchmark stays meaningful before it becomes a parlor trick. The evaluation community is scrambling toward dynamic environments (arenas), adversarial tasks, and debate-style challenges.</p><p><strong><a href="https://boundaryml.com/blog/structured-outputs-create-false-confidence">THE CONFIDENCE TRAP OF STRUCTURED OUTPUTS</a></strong></p><p>You know that satisfying feeling when your model returns perfectly formatted JSON every time? Yeah, about that&#8230;</p><p>New research from BoundaryML shows that constrained decoding&#8212;the thing that forces models to output valid structures&#8212;actually degrades quality. The model prioritizes conformance over correctness. Worse, it blocks safety refusals and warnings because they don&#8217;t fit the schema.</p><p>Benchmark scores go up. Real-world reliability goes down.</p><p>The model looks more competent while becoming less trustworthy. That&#8217;s not a bug in the evaluation&#8212;it&#8217;s a feature of how we&#8217;ve built the evaluation.</p><p><strong><a href="https://x.com/natolambert/status/1999528636085649532">FINALLY, RECEIPTS WITH OLMo 3</a></strong></p><p>In a world of &#8220;trust us, it&#8217;s good&#8221; releases, AI2 just dropped OLMo 3 with everything: all checkpoints, all training data, full post-training stack (SFT, DPO, RLVR), and the code to reproduce it.</p><p>Why does this matter? Now you can finally verify claims.</p><p>Here&#8217;s a fun one: RL with random rewards&#8212;a technique that worked fine on Qwen&#8212;completely fails on OLMo 3. Same method, different model, different outcome. Without the open release, nobody would know this.</p><p>This is what reproducible evaluation looks like. Take notes.</p><p><strong><a href="https://openai.com/index/frontierscience/">BENCHMARKS ROOTED IN REAL-WORLD SCIENTIFIC REALITY</a></strong></p><p>OpenAI released FrontierScience, a PhD-level scientific reasoning benchmark that does something wild&#8212;it validates answers in actual wet labs.</p><p>One result: a 79x efficiency gain in a cloning protocol, discovered by the model.</p><p>The point isn&#8217;t the number. It&#8217;s that model evaluations mean nothing without real-world application testing. You can ace every reasoning benchmark and still be useless in a lab. Or an office. Or anywhere that isn&#8217;t a standardized test.</p><div><hr></div><h2><strong>AI Research Rabbitholes</strong></h2><p><strong>SAGE-Bench: Agents Turn Video Chaos into 66.1% Reasoning Wins, But Long Clips Still Stump VLMs</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/allen_ai/status/2001351082916630586&quot;,&quot;full_text&quot;:&quot;&#127909; Introducing SAGE, an agentic system for long video reasoning on entertainment videos&#8212;sports, vlogs, &amp;amp; more. It learns when to skim, zoom in, &amp;amp; answer questions directly. On our SAGE-Bench eval, SAGE with a Molmo 2 (8B)-based orchestrator lifts accuracy from 61.8% &#8594; 66.1%. &#129525; &quot;,&quot;username&quot;:&quot;allen_ai&quot;,&quot;name&quot;:&quot;Ai2&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T17:57:36.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8Y6M9JX0AAIfpm.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/JkGlw3Ad1Z&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:18,&quot;like_count&quot;:126,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><br><strong>Project:</strong> <a href="https://praeclarumjj3.github.io/sage/">https://praeclarumjj3.github.io/sage/</a> | <strong>Code:</strong> <a href="https://github.com/allenai/SAGE">https://github.com/allenai/SAGE</a> | <strong>Paper:</strong> <a href="https://arxiv.org/abs/2512.13874">https://arxiv.org/abs/2512.13874</a></p><p><strong>SERA-Crypto Agent Crushes DMind/Web3 Evals at &lt;45s, Outpacing GPT-5 on Live Queries</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/SentientAGI/status/1999173125099913447&quot;,&quot;full_text&quot;:&quot;Announcing SERA-Crypto (Semantic Embedding &amp;amp; Reasoning Agent): our new reasoning architecture built for SOTA crypto research.\n\n#1 open-source agent on DMind\n#1 on our live crypto benchmark\n\nOutperforms GPT-5, Grok 4, Gemini 2.5 Pro, and Perplexity Finance&#8230;all under 45 seconds. &quot;,&quot;username&quot;:&quot;SentientAGI&quot;,&quot;name&quot;:&quot;Sentient&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-11T17:43:10.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G756mDrXsAIgO3p.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/zWG2VmKeOw&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:124,&quot;like_count&quot;:803,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Blog:</strong> <a href="https://blog.sentient.xyz/posts/semantic-embeddings-reasoning-agent-crypto">https://blog.sentient.xyz/posts/semantic-embeddings-reasoning-agent-crypto</a> | <strong>Demo:</strong> <a href="https://chat.sentient.xyz/">https://chat.sentient.xyz/</a></p><p><strong>Eval Protocol:</strong> <strong>Turns Your Dusty Evals into Live Training Signals, 43% Text2SQL Jump w/ No Extra RM</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/FireworksAI_HQ/status/2001372538342146529&quot;,&quot;full_text&quot;:&quot;Everyone has evals. No one knows what to do with them. Meanwhile, your agents aren&#8217;t actually getting smarter. Most teams end up with a glorified report card.\n\nBut with Eval Protocol, those same evals can finally do real work.\nThe exact evaluation definition you already have &quot;,&quot;username&quot;:&quot;FireworksAI_HQ&quot;,&quot;name&quot;:&quot;Fireworks AI&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T19:22:51.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8ZNtqUbMAAcCPB.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/Qgk5VSGZ01&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:4,&quot;like_count&quot;:12,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Blog:</strong> <a href="https://fireworks.ai/blog/self-improving-agent">https://fireworks.ai/blog/self-improving-agent</a></p><p><strong>Delphi Middleweight Evals Round 5: Gensyn&#8217;s Market Bets Nail Reasoning Gaps in Frontier Models</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/gensynai/status/2001378281095049238&quot;,&quot;full_text&quot;:&quot;Eval 5 of 11 of the Gensyn Middleweight General Reasoning Benchmark market on Delphi is live.     \n\nView the full benchmarking results now:\n<a class=\&quot;tweet-url\&quot; href=\&quot;https://github.com/gensyn-ai/delphi-middleweight-reasoning\&quot;>github.com/gensyn-ai/delp&#8230;</a> &quot;,&quot;username&quot;:&quot;gensynai&quot;,&quot;name&quot;:&quot;gensyn&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T19:45:40.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8ZPJP3asAAU4Ev.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/izMkhxinSp&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:3,&quot;like_count&quot;:46,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Results:</strong> <a href="https://github.com/gensyn-ai/delphi-middleweight-reasoning">https://github.com/gensyn-ai/delphi-middleweight-reasoning</a> | <strong>Market:</strong> <a href="https://delphi.gensyn.ai/">https://delphi.gensyn.ai/</a></p><p><strong>PostTrainBench: Agents Post-Train Base LLMs in 10h, Claude Code&#8217;s &#8220;Reward Hacking&#8221; Exposed</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/maksym_andr/status/2001349332633854267&quot;,&quot;full_text&quot;:&quot;We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs.\n\nWe expect this to be an important indicator for AI R&amp;amp;D automation as it unfolds over the next few years.\n\n&#128279; <a class=\&quot;tweet-url\&quot; href=\&quot;http://posttrainbench.com\&quot;>posttrainbench.com</a>\n&#128194; <a class=\&quot;tweet-url\&quot; href=\&quot;http://github.com/aisa-group/PostTrainBench\&quot;>github.com/aisa-group/Pos&#8230;</a>\n\n1/n &quot;,&quot;username&quot;:&quot;maksym_andr&quot;,&quot;name&quot;:&quot;Maksym Andriushchenko&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T17:50:38.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8Y2k4QXAAAyeKT.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/3GJhUgiJWy&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:26,&quot;like_count&quot;:221,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Site: </strong><a href="https://posttrainbench.com">https://posttrainbench.com</a> | <strong>GitHub</strong>: <a href="https://github.com/aisa-group/PostTrainBench">https://github.com/aisa-group/PostTrainBench</a></p><p><strong>Context-Bench Update: GPT-5.2 Dethrones Opus 4.5 on Filesystem/Skills: But DeepSeek v3.2 OSS King</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/Letta_AI/status/1999325307623604698&quot;,&quot;full_text&quot;:&quot;New Context-Bench results: <span class=\&quot;tweet-fake-link\&quot;>@OpenAI</span>'s GPT-5.2 takes #1 on both eval suites, dethroning <span class=\&quot;tweet-fake-link\&quot;>@AnthropicAI</span>'s Opus 4.5\n\nBreakdown:\n- 6% better on Filesystem, 9% better on Skills\n- 1.5-2.1x Opus cost but still cheaper than Gemini 3 Pro  \n\n<span class=\&quot;tweet-fake-link\&quot;>@deepseek_ai</span> v3.2 takes #1 on OSS jumps:\n- 16%  https://t.co/3u30306RHR&quot;,&quot;username&quot;:&quot;Letta_AI&quot;,&quot;name&quot;:&quot;Letta&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-12T03:47:53.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G78HmTnaUAAjkom.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/UxSWF0H25Z&quot;,&quot;alt_text&quot;:&quot;https://leaderboard.letta.com&quot;}],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;New Context-Bench results with Opus 4.5 and Gemini 3! \n\nWe compare recent model releases on context engineering tasks (filesystem &amp;amp; skill use) -- which determine memory + context management capabilities\n\nOpus 4.5 tops the filesystem suite with impressive reductions in cost https://t.co/mtPm2XITW2&quot;,&quot;username&quot;:&quot;Letta_AI&quot;,&quot;name&quot;:&quot;Letta&quot;},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:3,&quot;like_count&quot;:22,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Results:</strong> <a href="https://www.letta.com/blog/context-bench">https://www.letta.com/blog/context-bench</a></p><p><strong>Augment Code Review: GPT-5.2-Powered Reviewer Tops Precision/Recall on 50 Real PRs</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/augmentcode/status/1999182788671733991&quot;,&quot;full_text&quot;:&quot;Introducing Augment Code Review, powered by GPT 5.2. It's the #1-ranked AI code reviewer across precision, recall, and overall quality.\n\nFree for the first week for every paying customer, and free for open source projects. &quot;,&quot;username&quot;:&quot;augmentcode&quot;,&quot;name&quot;:&quot;Augment Code&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-11T18:21:34.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/o555zhl7vhoyfq5c6pnj&quot;,&quot;link_url&quot;:&quot;https://t.co/AFOCwXZUK6&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:21,&quot;like_count&quot;:93,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/1998838457888813056/vid/avc1/480x270/muijmEUX3h_fNDkA.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Blog:</strong> <a href="https://www.augmentcode.com/blog/code-review-benchmark">https://www.augmentcode.com/blog/code-review-benchmark</a> | <strong>Free Trial:</strong> <a href="https://www.augmentcode.com/product/code-review">https://www.augmentcode.com/product/code-review</a></p><p><strong>Grok Voice Agent API: #1 Big Bench Audio at 92.3%: xAI&#8217;s $0.05/min Latency Slayer</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/rohanpaul_ai/status/2001630466278068582&quot;,&quot;full_text&quot;:&quot;&#129504; Grok Voice Agent is xAI&#8217;s first public speech-to-speech API, and it now leads Artificial Analysis's Big Bench Audio benchmark with 92.3% speech reasoning accuracy.\n\n- Ahead of Gemini 2.5 Flash Native Audio and GPT Realtime\n- Big Bench Audio is built from 1,000 audio questions  https://t.co/GPZzkGzri1&quot;,&quot;username&quot;:&quot;rohanpaul_ai&quot;,&quot;name&quot;:&quot;Rohan Paul&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-18T12:27:46.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8c3KqbakAAPtCU.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/3Esj0ilNMa&quot;}],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;&#127897;&#65039; xAI launched the Grok Voice Agent API\n\n- Leads the industry in cost-efficiency. Developers are billed at a simple flat rate of $0.05 per minute.\n- In blind head-to-head human evaluations against OpenAI, Grok is consistently rated as the preferred model across axes such as https://t.co/cvDBFvJ5gd&quot;,&quot;username&quot;:&quot;rohanpaul_ai&quot;,&quot;name&quot;:&quot;Rohan Paul&quot;},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:0,&quot;like_count&quot;:2,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Gemini 3 Flash: Cost-Intelligence King at 91 IQ: Outsmarts Opus 4.5 on AA Index</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/OfficialLoganK/status/2001368440016392314&quot;,&quot;full_text&quot;:&quot;Gemini 3 Flash on the <span class=\&quot;tweet-fake-link\&quot;>@ArtificialAnlys</span> intelligence benchmark, the most cost per intelligence efficient model in the world!!! &quot;,&quot;username&quot;:&quot;OfficialLoganK&quot;,&quot;name&quot;:&quot;Logan Kilpatrick&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T19:06:34.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8ZJufSbwAASizj.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/4gWjDFltpN&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:97,&quot;like_count&quot;:1162,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Index:</strong> <a href="https://artificialanalysis.ai/">https://artificialanalysis.ai/</a></p><p><strong>Video Reality Test: Veo3.1 Fools VLMs at 56%: Humans Spot ASMR Fakes at 81%</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/rohanpaul_ai/status/2001412591130939559&quot;,&quot;full_text&quot;:&quot;&#129514; Video Reality Test is a new benchmark that checks if AI-generated ASMR videos with sound can fool humans and video language models (VLMs).\n\nFinal takeaway, many VLMs still miss micro-consistency errors, like whether the sound truly matches the exact touch, scrape, or cut.\n\nIn &quot;,&quot;username&quot;:&quot;rohanpaul_ai&quot;,&quot;name&quot;:&quot;Rohan Paul&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-17T22:02:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/txgoszc2zfhugde9zzwk&quot;,&quot;link_url&quot;:&quot;https://t.co/xoQLO57Mi7&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:2,&quot;like_count&quot;:7,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2001412274494541825/vid/avc1/480x270/kZDp_5Lf3I3qL9rw.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Paper:</strong> <a href="https://arxiv.org/abs/2512.13281">https://arxiv.org/abs/2512.13281</a></p><p><strong>GuideLLM: Visualises LLM Trade-Offs: Accuracy vs Latency/Cost in One Dashboard</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/RedHat_AI/status/2000586200256528389&quot;,&quot;full_text&quot;:&quot;AI Explained: Benchmarking LLMs with GuideLLM\n\nAccuracy, Latency, Cost. Usually, if you pick two, the third one gets lost. Jenny Yi from Red Hat AI explains how GuideLLM helps you visualize the trade-offs and benchmark your models before you deploy.\n\n&#128279; <a class=\&quot;tweet-url\&quot; href=\&quot;https://github.com/vllm-project/guidellm\&quot;>github.com/vllm-project/g&#8230;</a> &quot;,&quot;username&quot;:&quot;RedHat_AI&quot;,&quot;name&quot;:&quot;Red Hat AI&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-15T15:18:13.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/xnen7gbzhw85yssdkpxy&quot;,&quot;link_url&quot;:&quot;https://t.co/s2pvGPkEUT&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:4,&quot;like_count&quot;:13,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2000585783044931590/vid/avc1/320x568/4xy2thA3UTfAGa4w.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>GitHub:</strong> <a href="https://github.com/vllm-project/guidellm">https://github.com/vllm-project/guidellm</a></p><p><strong>LocalSearchBench: Agents Flop at Real-World Planning: 90% Fail Basic Day Trips</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/alxnderhughes/status/1998330882227425397&quot;,&quot;full_text&quot;:&quot;This is insane&#8230; Someone finally built a benchmark that shows how bad AI agents actually are at real-world local search and the results are shocking.\n\nEveryone keeps acting like AI agents are about to take over the real world.\n\nBut LocalSearchBench just proved the opposite: &quot;,&quot;username&quot;:&quot;alxnderhughes&quot;,&quot;name&quot;:&quot;Alex Hughes&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-09T09:56:24.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G7t_WLEbAAA55L1.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/ZY1S9Zs4kQ&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:57,&quot;like_count&quot;:238,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Paper:</strong> <a href="https://arxiv.org/pdf/2512.07436">https://arxiv.org/pdf/2512.07436</a></p><p><strong>Nemotron3 Nano: NVIDIA&#8217;s 30B Hybrid SSM Tops Agent Benches w/ 1M Context</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/mervenoyann/status/2000583432997331083&quot;,&quot;full_text&quot;:&quot;NVIDIA cooked!\n\nNemotron3 Nano is in, along with the training datasets and an agent env library &#128293;\n\n&amp;gt; A3B/30B hybrid SSM, 1M context size, built for agentic use \n&amp;gt; leading in both benchmarks and throughput &#128588;&#127995;\n&amp;gt; license enables commercial use \n\nSuper and Ultra coming soon! &quot;,&quot;username&quot;:&quot;mervenoyann&quot;,&quot;name&quot;:&quot;merve&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-15T15:07:14.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8N_BHvWEAgsnBZ.png&quot;,&quot;link_url&quot;:&quot;https://t.co/7usqDJJZz7&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:15,&quot;like_count&quot;:123,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Collection:</strong> <a href="https://huggingface.co/collections/nvidia/nvidia-nemotron-v3">https://huggingface.co/collections/nvidia/nvidia-nemotron-v3</a></p><p><strong>KAT-Coder-Pro V1: 64 Intelligence Index: Non-Reasoning Champ w/ 73.4% SWE-Verified</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/KwaiAICoder/status/2000524762980974972&quot;,&quot;full_text&quot;:&quot;KAT-Coder-Pro V1&#65306; HUGE UPDATE!\n\nThe latest version of KAT-Coder-Pro V1 hits 64 on the Artificial Analysis Intelligence Index,  ranking 10th globally among all models\n\n- Boosted benchmark performance\n- Strengthened reasoning capabilities\n- Enhanced front-end capabilities\n- &quot;,&quot;username&quot;:&quot;KwaiAICoder&quot;,&quot;name&quot;:&quot;KwaiKAT&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-15T11:14:06.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8NKqgwasAAmccj.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/Avh8DR0Pci&quot;},{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8NKqgsbMAAhWJ8.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/Avh8DR0Pci&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:31,&quot;like_count&quot;:294,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>MathArena V2: Beyond Leaderboards: IRT Reveals Model Surprises &amp; Confidence Bands</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/j_dekoninck/status/2000589658351378938&quot;,&quot;full_text&quot;:&quot;Benchmarks need more than a single number at the top of a leaderboard.\n\nWe have always believed this at MathArena, and are now taking it to the next level: introducing V2 of our website, with many new features to fully extract all useful information out of our benchmark &#127881; &quot;,&quot;username&quot;:&quot;j_dekoninck&quot;,&quot;name&quot;:&quot;Jasper Dekoninck&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-15T15:31:58.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8OFsUbbQAALAAh.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/brAcE9Foxy&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:10,&quot;like_count&quot;:60,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Site:</strong> <a href="https://matharena.ai/">https://matharena.ai/</a></p><p><strong>FAI-C Benchmark: LLMs Flop Christian Values: Generic Spirituality Over Biblical Ethics</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/PGelsinger/status/2000614696513323059&quot;,&quot;full_text&quot;:&quot;Excited to launch the Flourishing AI Christian (FAI-C) Benchmark today. This builds on the original Flourishing AI Benchmark and measures how the top LLMs support holistic human wellbeing through a user's specific value system -- in this case, Christianity. &quot;,&quot;username&quot;:&quot;PGelsinger&quot;,&quot;name&quot;:&quot;Pat Gelsinger&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-15T17:11:27.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/G8ObsG-bwAE_EDa.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/qAVRmBvfMm&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:20,&quot;like_count&quot;:171,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Report:</strong> <a href="https://gloo.com/flourishing-hub/research">https://gloo.com/flourishing-hub/research</a></p><p><strong>AlphaXiv SOTA Tracker: arXiv&#8217;s Million-Paper Index Ranks True Leaders Across Benches</strong></p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/askalphaxiv/status/1999123253793902619&quot;,&quot;full_text&quot;:&quot;Introducing benchmark tracking across all of arXiv\n\nWith thousands of models and benchmarks out there, it's impossible to track what's actually state-of-the-art\n\nWe now index results from millions of papers to surface which models are SOTA according to open source research! &quot;,&quot;username&quot;:&quot;askalphaxiv&quot;,&quot;name&quot;:&quot;alphaXiv&quot;,&quot;profile_image_url&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-12-11T14:25:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/vrv7cro7qiaji9kgao1f&quot;,&quot;link_url&quot;:&quot;https://t.co/SW7t8jjQ5G&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:38,&quot;like_count&quot;:205,&quot;impression_count&quot;:0,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/1999055583430000641/vid/avc1/1950x1080/aFqQbfD5CxXR_FAM.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p><strong>Tracker:</strong> <a href="https://www.alphaxiv.org/state-of-the-art">https://www.alphaxiv.org/state-of-the-art</a></p><div><hr></div><h2><strong>What We&#8217;re Hacking This Week</strong></h2><p><strong>Replay: Because Waiting for Markets is Stupid</strong></p><p>If you want to learn from a week of market behavior, you have to wait a week. That&#8217;s insane when the thesis is compounding improvement at scale.</p><p>So we built Replay, a training tool for trading agents that turns &#8220;waiting for the market&#8221; into &#8220;running it again on the same hour 200 times.&#8221; We wanted to make it possible to train 100 agents on a week of market data in a few hours.</p><p><strong>We Killed PRs (And Nothing Bad Happened)</strong></p><p>Hot Take: Pull requests are a bottleneck that AI coding makes less defensible.</p><p>So we disabled them. </p><p>Agent-generated code now auto-closes on submit. Instead of PRs, reviews and quality are enforced by automation, pre-commit hooks, merge hooks, and pre-push hooks running the full QA suite. It&#8217;s lint, types, and tests cranked up to 11. Humans would find this annoying. Coding agents love it.</p>]]></content:encoded></item><item><title><![CDATA[In the Arena: Week 2]]></title><description><![CDATA[GPT-5.2 scores 90.5% on ARC-AGI but breaks on Code Arena. Grok-4 pulls 167% more profit than GPT-5. Stanford says 1 in 20 benchmarks is broken, but maybe they're all broken.]]></description><link>https://newsletter.recall.network/p/in-the-arena-week-2</link><guid isPermaLink="false">https://newsletter.recall.network/p/in-the-arena-week-2</guid><dc:creator><![CDATA[Sanket]]></dc:creator><pubDate>Fri, 12 Dec 2025 15:39:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Jp51!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jp51!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jp51!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jp51!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1631469,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181323258?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jp51!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!Jp51!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73289bd8-73bc-4c0e-944d-d5ad4a4dda1d_2000x1053.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Are LLMs better as single-asset maxis?</strong></h2><p>The best Bitcoin traders spend years watching BTC. They learn its rhythms, how it behaves around halving cycles, how it responds to macro news, how it moves differently during US versus Asia hours. Depth of focus is what separates professionals from amateurs who spread themselves thin and master nothing.</p><p>Similarly, do LLMs perform better trading crypto when they focus only one token? This week, we built and launched the <strong><a href="https://app.recall.network/competitions/32717de4-2646-42fa-8087-dfab079acc01">Aerodrome Arena</a></strong> to find out.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena by Recall! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ruyB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ruyB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 424w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 848w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 1272w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ruyB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png" width="847" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:847,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:116815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181323258?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ruyB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 424w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 848w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 1272w, https://substackcdn.com/image/fetch/$s_!ruyB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4ec26e8-cf64-4ed8-a7c9-15be40469b29_847x562.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We ran multiple instances of four popular LLMs: GPT-5, Claude Sonnet 4.5, Grok-4, and DeepSeek 3.2. Each instance was tasked with trading a single token on Aerodrome&#8217;s DEX with real capital at stake over a four day period. And after as little as 48 hours, we started gathering metrics that static benchmarks could never capture:</p><p><strong>Turnover Efficiency</strong> tracks profit extracted per trade. Grok-4 trading ETH lead at 0.24% per trade, 167% more efficient than GPT-5. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QfeR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QfeR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 424w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 848w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 1272w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QfeR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png" width="1336" height="571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:1336,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86505,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181323258?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QfeR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 424w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 848w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 1272w, https://substackcdn.com/image/fetch/$s_!QfeR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd46a04-67ba-4847-bc5f-c46295dddc70_1336x571.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Regret Curve</strong> tracks missed opportunity from suboptimal decisions. LLMs started the competition at 20% optimized picks and climbed to 90%+, proving they were able to learn in real-time without retraining. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PPOa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PPOa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 424w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 848w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 1272w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PPOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png" width="1334" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1334,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185562,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181323258?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PPOa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 424w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 848w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 1272w, https://substackcdn.com/image/fetch/$s_!PPOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd4408e2-b737-46a2-b1e3-e3dc6bbf433e_1334x562.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Flip-Flop Monitor</strong> tracks how often LLMs reverse their positions under pressure. Reversal rates spiked to 70%, with cumulative fee drag eating 0.70% of capital.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hExe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hExe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 424w, https://substackcdn.com/image/fetch/$s_!hExe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 848w, https://substackcdn.com/image/fetch/$s_!hExe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 1272w, https://substackcdn.com/image/fetch/$s_!hExe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hExe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png" width="1325" height="497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:1325,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69248,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.recall.network/i/181323258?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hExe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 424w, https://substackcdn.com/image/fetch/$s_!hExe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 848w, https://substackcdn.com/image/fetch/$s_!hExe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 1272w, https://substackcdn.com/image/fetch/$s_!hExe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3faa5ae-e3f5-4738-83f8-73d12f6bf398_1325x497.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Beyond PnL, metrics like these mirror how real traders evaluate themselves to determine if they&#8217;re extracting value or just bleeding, improving or repeating mistakes, holding or bailing through volatility.</p><p><strong>So can models trade single-asset crypto?</strong> Early indications are they&#8217;re learning fast, and already outperforming the top 1% of traders on a lot of metrics. We&#8217;ll keep evaluating and investigating as the competition unfolds further.</p><p><strong>Visit Recall&#8217;s <a href="https://app.recall.network/competitions/32717de4-2646-42fa-8087-dfab079acc01">Aerodrome Arena</a> to track how LLMs perform trading single-asset crypto.</strong></p><div><hr></div><h2><strong>In the News</strong></h2><h3><strong>AI Breakthroughs</strong></h3><p><strong><a href="https://openai.com/index/introducing-gpt-5-2/">OpenAI&#8217;s GPT-5.2 Dominates Headlines</a></strong> &#8211; GPT-5.2 just dropped with numbers that matter: 90.5% SOTA on ARC-AGI v1 (390x efficiency gain YoY&#8212;from $4.5k/task to $11.64), human-expert level on GDPval (70.9% win rate across 44 occupations, 74.1% on economically valuable tasks), 100% on AIME 2025, and hallucinations cut 30-40%. <strong>But here&#8217;s the catch:</strong> it breaks on Code Arena despite SWE-bench gains. </p><p><strong><a href="https://arcprize.org">ARC Prize 2025: NVARC Hits 25% on AGI Reasoning Benchmark</a></strong> &#8211; NVARC achieved 25.03% accuracy on ARC-AGI&#8212;the critical metric for measuring genuine reasoning capabilities beyond pattern matching. All winning solutions are open-sourced, giving practitioners reproducible benchmarking approaches. Poetiq established new SOTA on ARC-AGI-1 and 2 using meta-systems that build intelligence on top of any model, integrating new models within hours of release.</p><p><strong><a href="https://twitter.com/NousResearch/status/1998536543565127968">Nomos 1 Reaches #2 on Putnam: 87/120 with 3B Active Parameters</a></strong> &#8211; A 30B open-weight model ranks #2 of 3,988 on this year&#8217;s Putnam exam&#8212;achieved via specialized post-training and agentic orchestration with only ~3B active parameters at inference. The implication: smaller, specialized models with thoughtful evaluation pipelines outperform brute-force scale. </p><p><strong><a href="https://www.resemble.ai/detect-3b-omni/">DETECT-3B Omni: Benchmarking Deepfake Detection at Scale</a></strong> &#8211; The newly released DETECT-3B Omni multimodal deepfake detector achieves 94-99% accuracy across image, voice, and video, now ranked #1 on HuggingFace Speech DeepFake Arena and DFBench.</p><h3><strong>Reasoning Models Dominate Production</strong></h3><p><strong><a href="https://openrouter.ai/reports/2025">50% of API Usage in Under 12 Months</a> </strong>&#8211; OpenRouter&#8217;s analysis of 100 trillion tokens reveals that reasoning models now exceed 50% of token consumption. The shift from single-pass generation to multi-step deliberation happened faster than predicted. Chinese closed models (DeepSeek, Qwen3, Kimi K2) captured a massive share while open-source usage plateaued. <em>The evaluation gap: models performing differently at reasoning_depth=1 vs depth=5 can&#8217;t be compared on legacy benchmarks.</em> </p><h3><strong>Google Continues It&#8217;s Campaign of Dominance</strong></h3><p><strong><a href="https://blog.google/technology/developers/gemini-3-pro-vision/">Gemini 3 Deep Think: Parallel Reasoning Architecture</a> </strong>&#8211; Explores multiple reasoning paths simultaneously (unlike o1&#8217;s sequential approach), showing +24 points on Arena Expert prompts. But Opus 4.5 non-thinking also excels, revealing reasoning architecture diversity requires different evaluation metrics. <em>Parallel vs sequential creates different failure modes.</em></p><p><strong>Gemini 3 Pro: Multimodal Vision Evaluation Challenge </strong>&#8211; Sets new benchmarks in multimodal vision with enhanced document, screen, spatial, and video understanding. Raises the bar for visual grounding evaluation, existing benchmarks may not catch hallucinations in spatial reasoning. New frameworks are needed for document understanding at scale.</p><h3><strong>Mistral Pushes Forward Open Weights</strong></h3><p><strong><a href="https://mistral.ai">Devstral 2: Open-Weights Changes Evaluation Game</a> </strong>&#8211; 123B dense model (72.2% SWE-bench) + 24B variant (68%), both open weights, tied/beat DeepSeek v3.2. Ships with Vibe CLI for agentic workflows. Open-weights at GPT-4o coding levels means you can run your own benchmarks, fine-tune for tasks, and verify internal behaviours. </p><h3><strong>Alignment Is Capability</strong></h3><p><strong><a href="https://off-policy.com">Why Evaluation Must Be Built Into Training</a> </strong>&#8211; Counter-intuitive thesis: labs treating alignment as a post-hoc constraint will hit a ceiling, while those integrating it into core research pull ahead. AGI requires alignment from the ground up. This reframes the entire evaluation problem, you can&#8217;t bolt on safety verification after training. The architecture itself determines what&#8217;s verifiable. <em>Evaluation isn&#8217;t separate from capability development. It&#8217;s part of the same problem.</em></p><h3><strong>Reliability More Important Than Capability?</strong></h3><p><strong><a href="https://arxiv.org">MAP Study: Reliability Beats Capability</a> </strong>&#8211; Berkeley/Stanford/IBM study found that simple controllable patterns with heavy human oversight dominate production. Complex reasoning agents reliably fail on edge cases that eval suites don&#8217;t catch. Environment/tool reliability matters more than algorithm innovation. <em>Only 5% of AI projects are getting value because teams choose the wrong problems. Selection bias in evaluation kills more than bad architectures.</em></p><div><hr></div><h2><strong>AI Research Rabbitholes</strong></h2><p><strong>GPT-5.2 Headlines</strong></p><ul><li><p><strong><a href="https://openai.com/index/introducing-gpt-5-2/">GPT-5.2: 90.5% SOTA on ARC-AGI v1, Human-Expert on GDPval, 100% AIME 2025</a></strong> &#8212; 390x efficiency gain YoY ($4.5k/task to $11.64); hallucinations cut 30-40%; independent evals flag gaps vs Opus/Gemini on agentic tasks | <a href="https://x.com/OpenAI/status/1999182104362668275">X Thread</a></p></li><li><p><strong><a href="https://contextarena.com/">Context Arena: GPT-5.2 X-High Hits 86.1% SOTA on 8-Needle MRCR at 128K</a></strong> &#8212; Up to 20-min responses, 5x tokens vs priors | <a href="https://x.com/DillonUzar/status/1999326860530876866">X Thread</a></p></li><li><p><strong><a href="https://www.swebench.com/">SWE-Bench Update: GPT-5.2 High #3 (80.0%), Medium Closes Gap to Sonnet 4.5</a></strong> &#8212; Fewer steps (14-17 vs 100+), cost-efficient | <a href="https://x.com/KLieret/status/1999222709419450455">X Thread</a></p></li><li><p><strong><a href="https://openai.com/index/gdpval/">GDPval: GPT-5.2 Outperforms In-Domain Experts on Knowledge Work</a></strong> &#8212; First model at human-expert level (70.9% win rate) across 44 occupations | <a href="https://x.com/polynoamial/status/1999186989388824935">X Thread</a></p></li><li><p><strong><a href="https://andonlabs.com/evals/vending-bench-2">Vending-Bench 2: GPT-5.2 Ranks #3 with Strong Continual Learning</a></strong> &#8212; Performance jumps in simulation&#8217;s second half suggest adaptation gains | <a href="https://x.com/andonlabs/status/1999421776640749837">X Thread</a></p></li><li><p><strong>LisanBench: GPT-5.2 Thinking Improves Validity but Trails Opus 4.5 &amp; Gemini 3 Pro</strong> &#8212; Sets 2 new records; lags in reasoning efficiency | <a href="https://x.com/scaling01/status/1999240662147825876">X Thread</a></p></li></ul><p><strong>Fresh Benchmarks</strong></p><ul><li><p><strong>KaBLE Eval: LLMs Struggle Distinguishing Facts from False Beliefs</strong> &#8212; Stanford&#8217;s 13K-question suite; GPT-4o drops to 64.4% on false beliefs, risks in healthcare/law | <a href="https://x.com/psypost/status/1999451893794480310">X Thread</a></p></li><li><p><strong><a href="https://cohere.com/blog/rerank-4">Cohere Rerank 4: SOTA Reranker with Best Relevance, Speed, Multilingual</a></strong> &#8212; Deployable on Cohere API/AWS/Azure | <a href="https://x.com/aidangomez/status/1999167985752187092">X Thread</a></p></li><li><p><strong><a href="https://deepmind.google/technologies/facts-benchmark/">FACTS Benchmark Suite: Gemini 3 Pro Leads at 68.8%</a></strong> &#8212; DeepMind&#8217;s first comprehensive eval across knowledge/search/grounding/multimodal; shared on Kaggle for reproducibility | <a href="https://x.com/googledeepmind/status/1998831084277313539">X Thread</a></p></li><li><p><strong><a href="https://github.com/LiveBench/LiveBench/">LiveBench: Contamination-Free LLM Evaluation</a></strong> &#8212; Ongoing questions track model progress without data leaks; solves the &#8220;trained on the test set&#8221; problem.</p></li><li><p><strong><a href="https://github.com/JailbreakBench/jailbreakbench/">JailbreakBench: Standardised Safety Eval</a></strong> &#8212; Framework for testing adversarial attacks; reproducible red-teaming methodology | <a href="https://x.com/tom_doerr/status/1998544209322660332">X Thread</a></p></li><li><p><strong><a href="https://www.databricks.com/blog/introducing-officeqa-benchmark-end-to-end-grounded-reasoning">OfficeQA: Databricks&#8217; $100K Benchmark for Real Work</a></strong> &#8212; Tests reliability on mundane office tasks; competition for AI diligence evaluation | <a href="https://x.com/matei_zaharia/status/1998522148777083116">X Thread</a></p></li><li><p><strong><a href="https://www.uipath.com/ai/research/ui-cube-benchmark/">UI-Cube Benchmark: RPA Agent Evaluation</a></strong> &#8212; UiPath&#8217;s eval for automation agents; tests reliability in enterprise tasks where &#8220;not all AI is created equal.&#8221; | <a href="https://x.com/UiPath/status/1996958520189612107">X Thread</a></p></li></ul><p><strong>Reasoning &amp; AGI Benchmarks</strong></p><ul><li><p><strong><a href="https://arcprize.org/leaderboard">ARC-AGI Verified: Poetiq Hits 54% on Semi-Private Set</a></strong> &#8212; New SOTA on genuine reasoning tasks; humans solve 100% but verification confirms progress toward AGI | <a href="https://x.com/arcprize/status/1994100080081854674">X Thread</a></p></li><li><p><strong><a href="https://www.cortex-agi.com/">Cortex-AGI: DeepSeek Leads Procedural Puzzles at 41%</a></strong> &#8212; Exponential complexity tests; Grok 4.1 close on efficiency metrics | <a href="https://x.com/teortaxestex/status/1997351717692719158">X Thread</a> </p></li><li><p><strong><a href="https://passing2961.github.io/refinebench-page/">RefineBench: LLMs Struggle Self-Refinement Without Guidance</a></strong> &#8212; NeurIPS paper shows free-form tasks yield minimal gains sans checklists; evaluation structure matter | <a href="https://x.com/wellecks/status/1997348935766307088">X Thread</a></p></li></ul><p><strong>Agentic Benchmarks</strong></p><ul><li><p><strong><a href="https://nof1.ai/">Alpha Arena S1.5: Mystery Model Wins Trading Comp</a></strong> &#8212; Grok 4.20 achieved +12% avg returns with verifiable trades across 4 markets; economic outcomes as evaluation | <a href="https://x.com/jay_azhang/status/1996984618751611006">X Thread</a></p></li><li><p><strong><a href="https://gotrader.gopher-ai.com/">AutoGopher Agent Royale: 1200+ User-Owned AI Agents Compete on Hyperliquid Perps</a></strong><br>Decentralized arena for custom LLMs/strategies with wallet auth; Season 2 wrapped, emphasizing real P&amp;L over static evals | <a href="https://x.com/gopher_ai/status/1996310097988116526">X Thread </a></p></li></ul><p><strong>Domain-Specific Evals</strong></p><ul><li><p><strong><a href="https://www.kaggle.com/benchmarks/deepmind/indic-gen-bench/leaderboard">IndicGenBench: 29 Indian Languages Including 18 New Ones</a></strong> &#8212; DeepMind&#8217;s Kaggle-hosted benchmark for summarisation/translation/QA in low-resource Indic languages | <a href="https://x.com/kaggle/status/1998377514553512303">X Thread</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2512.04578">LexGenius: Expert-Level Legal AI for Chinese Law</a></strong> &#8212; 8K tagged questions expose gaps in ethics/procedural judgment beyond raw accuracy | <a href="https://x.com/rohanpaul_ai/status/1997145975324831841">X Thread</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2512.06065">PAI-Bench: Physical AI Perception/Prediction</a></strong> &#8212; 2,808 real cases; video generators lack coherence, multimodal LLMs flop at forecasting </p></li><li><p><strong><a href="https://arxiv.org/abs/2512.06065">EgoEdit: Egocentric Video Editing Benchmark</a></strong> &#8212; Snap Research&#8217;s streaming editor evaluation; 50K clips/100K pairs for AR applications | <a href="https://x.com/_akhaliq/status/1998435647183319190">X Thread</a></p></li></ul><p><strong>Multimodal &amp; Vision Evals</strong></p><ul><li><p><strong><a href="https://artificialanalysis.ai/image/arena">FLUX.2 [dev] Claims #1 Open T2I, #2 Editing</a></strong> &#8212; Black Forest Labs&#8217; non-commercial release benchmarked on image generation arena | <a href="https://x.com/bfl_ml/status/1996874709804470733">X Thread</a></p></li><li><p><strong><a href="https://artificialanalysis.ai/video/arena">Runway Gen-4.5 Leads T2V Arena vs Veo 3/Kling 2.5</a></strong> &#8212; Artificial Analysis shows new text-to-video edges Sora 2 Pro on realism/motion | <a href="https://x.com/ArtificialAnlys/status/1996052123470209164">X Thread</a></p></li></ul><p><strong>Coding &amp; Development Evals</strong></p><ul><li><p><strong><a href="https://huggingface.co/collections/EssentialAI/rnj-1">Rnj-1: Essential AI&#8217;s 8B Tops SWE-Bench Verified at 20.8%</a></strong> &#8212; USA open LLM from-scratch pretrain on zettaflops AMD/TPU; code/math focus | <a href="https://x.com/gordic_aleksa/status/1997128393939472805">X Thread</a></p></li><li><p><strong><a href="https://www.orchids.app/">Orchids IDE Tops App Bench for End-to-End Dev</a></strong> &#8212; Vibe coding with multimodal agent, browser, Supabase, Stripe integration; local with no lock-in | <a href="https://x.com/orchidsapp/status/1998426257504006222">X Thread</a></p></li></ul><p><strong>Meta-Research on Benchmarks</strong></p><ul><li><p><strong><a href="https://hai.stanford.edu/news/squashing-fantastic-bugs-researchers-look-to-fix-flaws-in-ai-benchmarks">Stanford: 1 in 20 AI Benchmarks Have Serious Flaws</a></strong> &#8212; Analysis of 445 NLP/ML evals finds construct validity issues; 8 recommendations for fixing benchmark design | <a href="https://x.com/StanfordHAI/status/1998817700366200854">X Thread</a></p></li><li><p><strong><a href="https://cleanlab.ai/blog/structured-output-benchmark/">Cleanlab: Structured Output Benchmarks Riddled with Label Errors</a></strong> &#8212; Open experiments fixing ground truth; transparency in evaluation data quality | <a href="https://x.com/tech_optimist/status/1997789993142780083">X Thread</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2512.05765">BioProtocol: AI-for-Science Benchmarks Broken</a></strong> &#8212; Screens 3.2K papers; proposes dialogue quality, orchestration, trust calibration for real R&amp;D cycles requiring multi-day memory | <a href="https://x.com/DeSciNews/status/1998890585672004130">X Thread</a></p></li><li><p><strong><a href="https://oxrml.com/measuring-what-matters/">Measuring What Matters: Construct Validity in LLM Benchmarks</a></strong> &#8212; NeurIPS paper shows only a fraction of evals follow best practices | <a href="https://x.com/bimedotcom/status/1998043927539073389">X Thread</a></p></li></ul><p><strong>Memory &amp; Long-Context</strong></p><ul><li><p><strong><a href="https://aclanthology.org/2025.acl-long.183/">LongBench v2: o1-preview at 57.7% Beats Human Baseline</a></strong> &#8212; Deeper understanding of realistic long-context multitasks; ACL paper | <a href="https://x.com/satoshihirose/status/1998273758222815313">X Thread</a></p></li><li><p><strong><a href="https://evaluations.metr.org/gpt-5-report/">METR Chart: Human Continual Learning &gt;&gt; LLM</a></strong> &#8212; 16-hour eval shows LLMs plateau fast while humans show no asymptote; implications for agent training | <a href="https://x.com/1a3orn/status/1997056050403725373">X Thread</a></p></li></ul><p><strong>Novel Eval Approaches</strong></p><ul><li><p><strong><a href="https://inference.labs/truthtensor">TruthTensor: Streaming Market Data for AI Evals</a></strong> &#8212; Inference Labs deprecates static benches; probability calibration on Polymarket for real-time accuracy | <a href="https://x.com/inferencelabs/status/1998157524516606294">X Thread</a></p></li><li><p><strong><a href="https://gensyn.ai/delphi">Gensyn&#8217;s Prediction Market for Model Intelligence</a></strong> &#8212; LMSR on-chain pricing; market belief over static benchmarks as evaluation signal | <a href="https://x.com/Leonis237/status/1998936091949412542">X Thread</a></p></li><li><p><strong><a href="https://x.com/datalabto/status/1998444781244928239">Datalab: Benchmark via ELO Scale + LLM-as-Judge</a></strong> &#8212; Pairwise matchups on H100S; browse raw outputs per sample/model for transparency | <a href="https://x.com/datalabto/status/1998444781244928239">X Thread</a></p></li></ul><p><strong>Safety &amp; Security</strong></p><ul><li><p><strong><a href="https://red.anthropic.com/2025/smart-contracts/">Anthropic: AI Agents Exploit $4.6M in Smart Contracts</a></strong> &#8212; Frontier Red Team benchmark reveals blockchain vulnerabilities; new eval suite for security | <a href="https://x.com/AnthropicAI/status/1995631802032287779">X Thread</a></p></li><li><p><strong><a href="https://speechmap.ai/">Mistral Large 3: 98.1% on SpeechMap Edging Grok 4</a></strong> &#8212; New high for controversial speech handling evaluation | <a href="https://x.com/xlr8harder/status/1996824485102866604">X Thread</a></p></li></ul><p><strong>Advanced Research</strong></p><ul><li><p><strong><a href="https://arxiv.org/abs/2512.06065">Native Parallel Reasoner: Self-Distilled RL for Parallel Reasoning</a></strong> &#8212; Qwen3-4B gains 24.5% performance with 4.6x speed; 100% genuine parallelism | <a href="https://x.com/AINativeF/status/1998557170548547664">X Thread</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2512.05765">The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics</a></strong> &#8212; Stanford argues LLMs need anchoring/oversight/memory layer; debate on coordination mechanisms | <a href="https://x.com/rohanpaul_ai/status/1998339087045120004">X Thread</a></p></li></ul><p><strong>Model Releases with Eval Focus</strong></p><ul><li><p><strong><a href="https://qwen.ai/blog?id=qwen3-omni-flash-20251201">Qwen3-Omni-Flash: Enhanced Multi-Turn Video/Audio</a></strong> &#8212; 119-lang text/19-speech support; personality customisation for evaluation diversity | <a href="https://x.com/Alibaba_Qwen/status/1998776328586477672">X Thread</a></p></li><li><p><strong><a href="https://huggingface.co/motif-technologies/Motif-2-12.7B-Reasoning">Motif-2-12.7B-Reasoning: Korea&#8217;s Top Open Model at 45 Intelligence Score</a></strong> &#8212; 80% AIME math/57% IFBench; punches above weight on agentic tasks | <a href="https://x.com/ArtificialAnlys/status/1998570291086373081">X Thread</a></p></li><li><p><strong><a href="https://huggingface.co/collections/zai-org/glm-46v">GLM-4.6V: ZAI&#8217;s VLM Rebuilds UIs from Screenshots</a></strong> &#8212; 106B/9B-Flash with 128K multimodal context + native function calling | <a href="https://x.com/adinayakup/status/1997999891704901873">X Thread</a></p></li></ul><div><hr></div><h2><strong>What We&#8217;re Hacking This Week</strong></h2><p>At Recall, we&#8217;re always pushing AI development forward internally and externally. Here are the most interesting things that happened this week.</p><p><strong>Agent Deployer: Spinning Up Identical Agents at Scale</strong></p><ul><li><p>Built a prototype to deploy 5-10 &#8220;identical&#8221; agents (same prompts, tools, timing) while only swapping the base model</p></li></ul><p><strong>Branding Agent: Design as Agent-Native Workflow</strong></p><ul><li><p>Turned &#8220;make me a header for X&#8221; into an agent workflow instead of an hour in Figma</p></li><li><p>Wraps Nano Banana with Gemini 3 Pro plus brand asset catalogue</p></li><li><p>Context engineering keeps outputs on-brand across styles (3D, illustration)</p></li></ul><p><strong>Single-Token Trading Agents: Better Primitive</strong></p><ul><li><p>Built a &#8220;Bitcoin top/bottom&#8221; style agent running every 15 minutes, trading only BTC vs USDC with technical analysis</p></li><li><p>Key learning: multi-asset agents blow their context budget trying to model everything</p></li><li><p>Single-asset specialists are cleaner, same harness spins into 20 agents in a day (GPT-5: BTC, Claude: VIRTUALS, etc)</p></li><li><p>Added in-context reflection where agents critique their own calls and update heuristics</p></li></ul><p><strong>Atlas 2.0: Proactive Context Engineer</strong></p><ul><li><p>Evolved from a reactive tool to a proactive context curator for AI coding agents</p></li><li><p>Stack of PRs ready, Claude-reviewed and signed off</p></li><li><p>Goal: better context engineering so AI-assisted development workflows actually compound learning instead of starting fresh each time</p></li></ul><p><strong>Aerodrome Analytics Dashboard with AI Report Generation</strong></p><ul><li><p>Live dashboard that tracks performance across multiple metrics to auto-generate content from competition data</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena by Recall! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[In the Arena: Week 1 ]]></title><description><![CDATA[The "smartest" AI today can't predict an NFL football game, but it can hack DeFi, destroy Putnam records, and beat every human engineer. Something's not adding up.]]></description><link>https://newsletter.recall.network/p/in-the-arena-week-1</link><guid isPermaLink="false">https://newsletter.recall.network/p/in-the-arena-week-1</guid><dc:creator><![CDATA[Sanket]]></dc:creator><pubDate>Fri, 05 Dec 2025 22:52:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fYNW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fYNW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fYNW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fYNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1631033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://recallnet.substack.com/i/180756057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fYNW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 424w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 848w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 1272w, https://substackcdn.com/image/fetch/$s_!fYNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b03274e-e175-43b8-a504-178d193c060a_2000x1053.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>An AI evaluation crisis? NFL Predictions reveal shortcomings.</h2><p><strong>The AI evaluation and benchmarking crisis isn&#8217;t theoretical anymore.</strong> When the answer to which model is &#8220;best&#8221; depends on which benchmark you check and what day you measure, static evals have failed. We need to move from static benchmarks to live arenas.</p><p>This Thanksgiving, Recall ran our first live predictive reasoning arena. Six leading AI models &#8211; Claude Opus 4.5, Gemini 3 Pro, GPT-5.1, DeepSeek v3.2, Qwen3 Max, Grok 4.1 &#8211; tried to predict the outcomes of three NFL football games from kickoff through final whistle. Predictions were time-weighted to reward early confidence over late game certainty.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>The results were unanimous. And wrong.</strong></p><p>$2B+ was bet on Thanksgiving NFL games. The best human sports bettors achieved a 55% success rate against the spread. <strong>All models got their pre-game predictions wrong.</strong> So no, LLMs can&#8217;t yet play moneyball better than Vegas.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_zR-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_zR-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_zR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg" width="1199" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;post image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="post image" title="post image" srcset="https://substackcdn.com/image/fetch/$s_!_zR-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_zR-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4406ccf9-cda7-48d3-95dc-1ba4a2af8c0c_1199x630.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every model predicted the Vegas favorite in every game. They just followed the money. Meanwhile, all three underdogs won.</p><ul><li><p><strong>KC @ DAL:</strong> </p><ul><li><p>All models picked KC Chiefs (64-68% confidence)</p></li><li><p>DAL Cowboys won 31-28</p></li></ul></li><li><p><strong>CIN @ BAL:</strong> </p><ul><li><p>All models picked BAL Ravens (72-78% confidence&#8212;highest certainty)</p></li><li><p>The CIN Bengals won 32-14 in a blowout</p></li></ul></li><li><p><strong>GB @ DET:</strong> </p><ul><li><p>All models picked DET Lions (62-65% confidence)</p></li><li><p>The GB Packers won 31-24</p></li></ul></li></ul><p>Claude Opus 4.5 topped our leaderboard with 0.651 on time-weighted confidence scoring. This is the same model scoring 80.9% on SWE-bench Verified.</p><p>If you&#8217;re deploying AI agents for trading, forecasting, or strategic planning, static benchmarks tell you almost nothing about reliability under real uncertainty. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.recall.network/nfl-arena-can-llms-predict-football-outcomes&quot;,&quot;text&quot;:&quot;NFL Arena Results&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.recall.network/nfl-arena-can-llms-predict-football-outcomes"><span>NFL Arena Results</span></a></p><div><hr></div><h2>In the News</h2><h4><strong><a href="https://www.anthropic.com/news/claude-opus-4-5">Anthropic&#8217;s Claude Opus 4.5 Beats Every Human Engineer&#8212;Then Gets Caught Hacking DeFi</a></strong></h4><p>Hit 80.9% on SWE-bench, verified while cutting prices by 67%. Outscored every human on Anthropic&#8217;s internal engineering exam. Users report that it maintains coherence across 11 parallel projects without breaking.</p><p>Anthropic published research revealing it can autonomously exploit 17 of 34 recent smart contract hacks, stealing $4.5M in simulated funds. </p><p>Tool Search Tool slashed context bloat 85%, jumping overall model accuracy from 72% to 90%.</p><h4><strong><a href="https://artificialanalysis.ai/articles/gemini-3-pro-everything-you-need-to-know">Google&#8217;s Gemini 3 Pro Tops Every Benchmark While Hallucinating 88% of Errors</a></strong></h4><p>Swept LMArena (1501 Elo), hit 91.9% on GPQA Diamond, dominated across Text, Vision, Coding, and Math. Scored 53% on AA-Omniscience, a 14-point jump.</p><p>The paradox: 88% hallucination rate when wrong. Users report it&#8217;s &#8220;evaluation-paranoid&#8221;, constantly questioning whether it&#8217;s 2025, inventing options despite explicit instructions.</p><p>Burns 17.8% of tokens on rambling, 8x worse than Opus. Deep Think variant costs 4.5x more for 45.1% on ARC-AGI-2.</p><h4><strong><a href="https://evaluations.metr.org/gpt-5-1-codex-max-report/">OpenAI&#8217;s GPT-5.1 Held the Coding Crown for THREE Days</a></strong></h4><p>Hit 77.9% on SWE-bench before Opus 4.5 crushed it. </p><h4><strong><a href="https://www.marktechpost.com/2025/11/28/deepseek-ai-releases-deepseekmath-v2-the-open-weights-maths-model-that-scored-118-120-on-putnam-2024/">DeepSeek&#8217;s Open Model Matched OpenAI at Math Olympiads&#8212;Then Crushed Best Human Score</a></strong></h4><p>DeepSeekMath-V2 achieved IMO 2025 gold (5 of 6 problems), matching Google and OpenAI. Scored 118/120 on Putnam, destroying the best human score of 90. Hit 99.0% on IMO-ProofBench versus Gemini&#8217;s 89.0%.</p><p>Completely open-source, trained on inferior hardware. V3.2 matches GPT-5 performance despite chip restrictions. Proves frontier AI isn&#8217;t Western-exclusive anymore.</p><div><hr></div><h2>AI Research &amp; Rabbitholes</h2><ol><li><p><strong>Yupp SVG Leaderboard: Gemini 3 Pro tops coherent vector generation</strong><br><a href="https://x.com/lintool/status/1996696157985398812">https://x.com/lintool/status/1996696157985398812<br></a>Leaderboard: <a href="https://yupp.ai/leaderboard/svg">https://yupp.ai/leaderboard/svg</a></p></li><li><p><strong>Gemini 3 Deep Think rolls out to Ultra subscribers</strong><br><a href="https://x.com/_philschmid/status/1996659997774958859">https://x.com/_philschmid/status/1996659997774958859</a></p></li><li><p><strong>AutoCodeBench-V2: Claude 4.5 Opus dominates refreshed coding eval</strong><br><a href="https://x.com/scaling01/status/1996595348916068626">https://x.com/scaling01/status/1996595348916068626</a></p></li><li><p><strong>Mistral Large 3 debuts as #1 open-source coding model on Arena<br>Mistral AI teases more coding details soon; tops OSS leaderboard for programming tasks.</strong><br><a href="https://x.com/MistralAI/status/1996580307336638951">https://x.com/MistralAI/status/1996580307336638951</a><br>Model: <a href="https://huggingface.co/mistralai/Mistral-Large-3-2512">https://huggingface.co/mistralai/Mistral-Large-3-2512</a> </p></li><li><p><strong>Seedream 4.5 jumps to #3 on image-editing leaderboard</strong><br><a href="https://x.com/arena/status/1996641968005566876">https://x.com/arena/status/1996641968005566876</a></p></li><li><p><strong>RWKV-7 G0b 13.3B pure RNN beats Qwen3 14B without eval-maxxing</strong><br><a href="https://x.com/BlinkDL_AI/status/1996556628376850541">https://x.com/BlinkDL_AI/status/1996556628376850541</a><br>Model: <a href="https://huggingface.co/BlinkDL/rwkv-7-g0b">https://huggingface.co/BlinkDL/rwkv-7-g0b</a></p></li><li><p><strong>DeepSeek-V3.2 Thinking hits 70.6% AUC on 2-needle long-context</strong><br><a href="https://x.com/DillonUzar/status/1996358865060073913">https://x.com/DillonUzar/status/1996358865060073913</a></p></li><li><p><strong>OneThinker-8B tops Qwen3-VL on 31 image/video benchmarks</strong><br><a href="https://x.com/rohanpaul_ai/status/1996415663104270701">https://x.com/rohanpaul_ai/status/1996415663104270701</a><br>Paper: <a href="https://arxiv.org/abs/2512.03043">https://arxiv.org/abs/2512.03043</a></p></li><li><p><strong>Uni-MoE 2.0 tops Qwen2.5 Omni on video &amp; speech</strong><br><a href="https://x.com/jiqizhixin/status/1996393265743249832">https://x.com/jiqizhixin/status/1996393265743249832</a></p></li><li><p><strong>CORE-Bench solved: Opus 4.5 + Claude Code reaches 95%</strong><br><a href="https://x.com/sayashk/status/1996334941832089732">https://x.com/sayashk/status/1996334941832089732</a></p></li><li><p><strong>INTELLECT-3 106B MoE takes #1 on Arena math/code</strong><br><a href="https://x.com/arena/status/1996324769013391839">https://x.com/arena/status/1996324769013391839</a></p></li><li><p><strong>Nvidia Orchestrator-8B beats GPT-5 on HLE at 2.5&#215; efficiency</strong><br><a href="https://x.com/HuggingPapers/status/1996310259695079570">https://x.com/HuggingPapers/status/1996310259695079570</a><br>Model: <a href="https://huggingface.co/nvidia/Orchestrator-8B">https://huggingface.co/nvidia/Orchestrator-8B</a></p></li><li><p><strong>Vending-Bench Arena: Opus 4.5 wins multi-agent competition</strong><br><a href="https://x.com/andonlabs/status/1996268508926386422">https://x.com/andonlabs/status/1996268508926386422</a></p></li><li><p><strong>Rosetta Stone for Benchmarks: Epoch AI unifies 100+ evals</strong><br><a href="https://x.com/EpochAIResearch/status/1996248575400132794">https://x.com/EpochAIResearch/status/1996248575400132794</a><br>Blog: <a href="https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks">https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks</a></p></li><li><p><strong>Thinking Algorithm Leaderboard launches &#8211; GPT-OSS 120B #1</strong><br><a href="https://x.com/cooper_nyc_/status/1995988121385603467">https://x.com/cooper_nyc_/status/1995988121385603467</a><br>Leaderboard: <a href="https://leaderboard.neurometric.ai">https://leaderboard.neurometric.ai</a></p></li><li><p><strong>VisualPuzzles: Gemini 3 Pro only 52.7% on knowledge-free logic</strong><br><a href="https://x.com/yueqi_song/status/1912510869491101732?s=20">https://x.com/yueqi_song/status/1995499844992127276</a><br>Project: <a href="https://visualpuzzles.github.io">https://visualpuzzles.github.io</a></p></li><li><p><strong>RefineBench: LLMs gain just 1.7% self-refining hard essays</strong><br><a href="https://x.com/seungonekim/status/1995383422806630863">https://x.com/seungonekim/status/1995383422806630863</a><br>Site: <a href="https://passing2961.github.io/refinebench-page">https://passing2961.github.io/refinebench-page</a></p></li><li><p><strong>Structured Prompting + HELM flips benchmark rankings by 4%</strong><br>Paper: <a href="https://arxiv.org/abs/2511.20836">https://arxiv.org/abs/2511.20836</a></p></li><li><p><strong>ALIGNEVAL: Judging skill 0.96 correlation with generator quality</strong><br><a href="https://x.com/carmelkron/status/1994527325023867233">https://x.com/carmelkron/status/1994527325023867233</a><br>Paper: <a href="https://arxiv.org/abs/2511.16043">https://arxiv.org/abs/2511.16043</a></p></li><li><p><strong>Nvidia Orchestrator-8B quietly tops HLE (original post)</strong><br><a href="https://x.com/NielsRogge/status/1994419877927404017">https://x.com/NielsRogge/status/1994419877927404017</a></p></li><li><p><strong>DeepSeekMath-V2 verifier loop beats Gemini DeepThink on IMO</strong><br><a href="https://x.com/mervenoyann/status/1994015342188757333">https://x.com/mervenoyann/status/1994015342188757333</a><br>Model: <a href="https://huggingface.co/deepseek-ai/DeepSeek-Math-V2">https://huggingface.co/deepseek-ai/DeepSeek-Math-V2</a></p></li><li><p><strong>PaperTalker turns papers into full narrated presentation videos</strong><br><a href="https://x.com/ChrisLaubAI/status/1993969771784921384">https://x.com/ChrisLaubAI/status/1993969771784921384</a><br>GitHub: <a href="https://github.com/showlab/Paper2Video">https://github.com/showlab/Paper2Video</a></p></li><li><p><strong>DeepWriter (team of 3) hits 50.91% on Humanity&#8217;s Last Exam</strong><br><a href="https://x.com/DeepwriterAI/status/1993803648900755585">https://x.com/DeepwriterAI/status/1993803648900755585</a></p></li><li><p><strong>AI voting sim: Grok-4 closest to real election results</strong><br><a href="https://x.com/RaphaelDabadie/status/1993748654675345600">https://x.com/RaphaelDabadie/status/1993748654675345600</a></p></li><li><p><strong>Distilling Efficient Reasoning: 12B from GPT-OSS cuts tokens 75%</strong><br><a href="https://x.com/omarsar0/status/1993695515595444366">https://x.com/omarsar0/status/1993695515595444366</a><br>Paper: <a href="https://arxiv.org/abs/2511.19333">https://arxiv.org/abs/2511.19333</a></p></li><li><p><strong>Grok 4.1 Fast takes #1 on Python coding leaderboard</strong><br><a href="https://x.com/cb_doge/status/1993697429124956471">https://x.com/cb_doge/status/1993697429124956471</a></p></li><li><p><strong>Opus 4.5 tops Elicit research QA at 96.5%</strong><br><a href="https://x.com/stuhlmueller/status/1993476570754040173">https://x.com/stuhlmueller/status/1993476570754040173</a></p></li><li><p><strong>Opus 4.5 only 21% on FrontierMath &#8211; trails GPT-5.1 &amp; Gemini 3 Pro</strong><br><a href="https://x.com/EpochAIResearch/status/1993431031765250119">https://x.com/EpochAIResearch/status/1993431031765250119</a></p></li><li><p><strong>PeptoneBench &amp; PepTron for disordered proteins</strong><br><a href="https://x.com/NVIDIAHealth/status/1993386709153636643">https://x.com/NVIDIAHealth/status/1993386709153636643</a><br>GitHub: <a href="https://github.com/peptoneltd/peptonebench">https://github.com/peptoneltd/peptonebench</a></p></li><li><p><strong>Gemini 3 Pro 93% GPQA Diamond via organic chemistry surge</strong><br><a href="https://x.com/EpochAIResearch/status/1993363375108333616">https://x.com/EpochAIResearch/status/1993363375108333616</a></p></li><li><p><strong>Agent0: Zero-data self-evolving agents +24% on reasoning</strong><br><a href="https://x.com/rryssf_/status/1992889473911378039">https://x.com/rryssf_/status/1992889473911378039</a><br>Paper: <a href="https://arxiv.org/abs/2511.14460">https://arxiv.org/abs/2511.14460</a></p></li><li><p><strong>Agentic Reviewer matches human ICLR feedback correlation 0.42</strong><br><a href="https://x.com/AndrewYNg/status/1993001922773893273">https://x.com/AndrewYNg/status/1993001922773893273</a></p></li><li><p><strong>Elon: Grok 5 vs top LoL pros in 2026 with human constraints</strong><br><a href="https://x.com/elonmusk/status/1993208505486979327">https://x.com/elonmusk/status/1993208505486979327</a></p></li><li><p><strong>Claude Opus 4.5 hits 80.9% SWE-Bench, beats GPT-5.1 &amp; Gemini 3 Pro</strong><br><a href="https://x.com/rohanpaul_ai/status/1993046494904217661">https://x.com/rohanpaul_ai/status/1993046494904217661</a></p></li><li><p><strong>ImagineArt 1.5 (indie) climbs to global #3 image gen</strong><br><a href="https://x.com/techbymarkandey/status/1992951100438368588">https://x.com/techbymarkandey/status/1992951100438368588</a></p></li><li><p><strong>Iterative Refinement: 7M-param RNN beats 671B LLMs on ARC-AGI</strong><br><a href="https://x.com/burkov/status/1992679461485994144">https://x.com/burkov/status/1992679461485994144</a></p></li><li><p><strong>CritPt physics benchmark: Gemini 3 Pro only 9.1% without tools</strong><br><a href="https://x.com/MinyangTian1/status/1991913292004995217">https://x.com/MinyangTian1/status/1991913292004995217</a><br>Paper: <a href="https://arxiv.org/abs/2509.26574">https://arxiv.org/abs/2509.26574</a> </p><p>GitHub: <a href="https://github.com/CritPt-Benchmark/CritPt">https://github.com/CritPt-Benchmark/CritPt</a></p></li></ol><div><hr></div><h2>What We&#8217;re Hacking This Week</h2><p>At Recall, we&#8217;re always pushing AI development forward internally and externally. Here are the most interesting things that happened this week.</p><p><strong>We built Atlas, an internal AI Chief of Staff, to summarise everything and keep our team in sync.</strong></p><ul><li><p>Automated daily summaries from GitHub, Linear, Notion, Slack, etc</p></li><li><p>Extracts structured requirements from notes and meetings</p></li><li><p>Partnership research with automated synergy analysis</p></li><li><p>Real-time web search via Perplexity integration</p></li><li><p>Automated KPI dashboard and metrics tracking</p></li></ul><p><strong>Our engineering team automated reviews with AI Agents </strong></p><ul><li><p>Multiple AI personas cross-checking each other&#8217;s work</p></li><li><p>Automated code generation through a pipeline system</p></li><li><p>Autonomously reviews every PR</p></li><li><p>Successfully identified and fixed the first production bug autonomously</p></li></ul><p><strong>We built a portfolio of trading agents with multiple strategies to simulate trading for our upcoming arena</strong></p><ul><li><p>Crypto trading arena agents on Aerodrome DEX trading agents</p></li></ul><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.recall.network/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading In The Arena! Keep reading more on AI models and their skills.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>