June 26, 2026

7 min read

AI & Machine Learning

GLM 5.2 vs Claude Opus 4.8: The $7 Paper Reproduction That Changes Research Economics

> A head-to-head comparison on reproducing the SDPO paper with full ablations. GLM 5.2 completes the task for $6.98 (6.6x cheaper than Opus 4.8) while using 41% fewer tokens. The era of affordable AI-assisted research is here.

ShareX LinkedIn

🎧 Listen — ~7 min

Audio generating··· Deepgram pipeline queued

~7 min

Verified by Essa Mamdani

When a frontier open-weight model goes head-to-head against the most capable closed-weight system on one of the hardest tasks in AI research—reproducing a full academic paper from scratch with ablations—the results are rarely this lopsided.

We pitted GLM 5.2 (Z.ai's MIT-licensed text-only powerhouse) against Claude Opus 4.8 (Anthropic's latest flagship) on a one-shot reproduction of the SDPO paper (Self-Distillation Policy Optimization). The task was brutal: resolve messy verl (visual encoder representation learning) issues, run ablations to completion, and confirm the paper's claims.

The outcome? GLM 5.2 didn't just win—it demolished Opus 4.8 on cost-efficiency while delivering comparable scientific rigor.

The Setup: Why This Task Is Brutal

Reproducing a modern RL paper isn't a weekend coding project. The SDPO paper (Hübotter et al., 2026) introduces a novel reinforcement learning method that uses successful rollouts as implicit feedback, allowing policies to learn from their own best generations without external demonstrations.

Our reproduction task required:

Resolving verl infrastructure issues — debugging the visual encoder stack, fixing dependency conflicts, and getting the training pipeline to run
Implementing SDPO with self-distillation — configuring alpha sweeps, logit-level vs sequence-level distillation, and top-k filtering
Running full ablations — testing GRPO baselines, forward/reverse KL variants, and alpha interpolation grids
Validating paper claims — confirming that SDPO reaches GRPO's accuracy 6× faster in wall-clock time

This is exactly the kind of task where you'd expect a $200/month closed-weight model to outperform a $4.40/M-tok open-weight alternative. You'd be wrong.

By The Numbers: A Staggering Cost Gap

Metric	GLM 5.2	Claude Opus 4.8	Advantage
Total Cost	~$6.98	$46.35	GLM: 6.6× cheaper
Token Consumption	2.65M	4.53M	GLM: 41% fewer tokens
Failed Runs (Pre-Success)	14	9	Opus: fewer initial failures
Successful Ablations	6	6	Tie
Total Experiments	8 nodes	7 nodes	GLM: more comprehensive

The most striking number: $6.98 vs $46.35. For less than the price of a decent lunch, GLM 5.2 completed a task that cost nearly fifty dollars on Anthropic's flagship. That's not a minor efficiency gain—that's a paradigm shift in research economics.

The Experiment Trees: How Each Model Approached The Problem

Both models used an OpenResearch framework with branching experiment trees. Each node represents a runnable branch with its own code environment. Here's how they differed:

GLM 5.2's Approach: Aggressive Exploration

GLM 5.2 built a wider tree with more alpha-sweep variations:

Root → SDPO (alpha=0.5) baseline
- → SDPO logit-level (full logit KL) ✅ DONE
  - → SDPO alpha=0.0 (forward KL) ✅ DONE
  - → SDPO alpha=0.25 ✅ DONE
  - → SDPO alpha=0.75 ✅ DONE
  - → SDPO alpha=1.0 (reverse KL) ✅ DONE
Root → GRPO baseline ❌ FAILED
- → SDPO sequence-level (reverse KL) ❌ FAILED

Key characteristic: GLM 5.2 ran a denser alpha grid (0.0, 0.25, 0.75, 1.0) around the central 0.5 value, suggesting a more thorough exploration of the interpolation space between forward and reverse KL divergences.

Opus 4.8's Approach: Structured Rounding

Opus used a more hierarchical "Round" structure:

Root → Round1: SDPO ✅ DONE
- → Round2: sequence-level SDPO ✅ DONE
- → Round2: logit-level (full SDPO) ✅ DONE
  - → Round3: alpha=0.0 ✅ DONE
  - → Round3: alpha=0.5 ✅ DONE
  - → Round3: alpha=1.0 ✅ DONE
Root → Round1: GRPO ❌ FAILED
- → Round1: GRPO (retry) ✅ DONE

Key characteristic: Opus 4.8's tree was slightly narrower (only 3 alpha values vs GLM's 4), but it explicitly labeled experiments by "Round," suggesting a more staged, methodological approach.

The Token Efficiency Mystery

Here's where it gets interesting: GLM 5.2 used 2.65M tokens compared to Opus 4.8's 4.53M tokens—a 41% reduction. Yet GLM had 14 failed runs before first success versus Opus's 9.

At first glance, this seems paradoxical. More failures should mean more tokens, right?

Not necessarily. The explanation likely lies in failure mode differences:

GLM 5.2 failed fast: Its 14 failures were likely quick crashes—dependency errors, shape mismatches, or configuration issues that terminated early. Each failure consumed relatively few tokens before hitting a wall.
Opus 4.8 failed slow: With only 9 failures but nearly double the tokens, Opus may have spun its wheels longer on each failed attempt—generating elaborate but incorrect fixes, writing extensive debugging code that didn't resolve the root issue, or over-analyzing before acting.

This pattern aligns with known behavioral differences: GLM 5.2 is more of a "move fast and break things" coder, while Opus 4.8 tends toward exhaustive analysis before execution. In research reproduction—where quick iteration beats perfect planning—the "fast fail" approach wins on both cost and wall-clock time.

What Both Models Got Right (And Wrong)

The Shared Challenge: verl Issues

Both models spent the bulk of their tokens resolving initial verl (visual encoder representation learning) infrastructure problems. This isn't surprising—modern RL training stacks are notoriously finicky, with CUDA version mismatches, distributed training configs, and dependency hell being standard obstacles.

The fact that both models eventually overcame these issues speaks to their core capability. The difference was in how they overcame them.

The GRPO Baseline Failure

Both models failed on their initial GRPO baseline runs—a standard reinforcement learning method that SDPO must beat. This is actually encouraging: it means both models were attempting rigorous scientific validation rather than just reproducing the SDPO variant. A reproduction that doesn't include a proper baseline is just a demo, not science.

GLM 5.2 never recovered its GRPO branch (the sequence-level retry also failed). Opus 4.8 successfully retried its GRPO baseline, giving it a slight edge in experimental completeness.

Ablations Completed

Both models successfully ran logit-level SDPO with alpha sweeps, confirming the paper's central claim that logit-level distillation outperforms sequence-level approaches. GLM's denser alpha grid (including 0.25 and 0.75) provided finer-grained insight into the interpolation behavior between forward KL (alpha=0) and reverse KL (alpha=1).

The Implications: Open-Weight Models Are Ready for Real Research

This comparison isn't just a fun benchmark—it has profound implications for how AI research will be conducted in 2026 and beyond.

1. Cost-Per-Paper Is Now Affordable

At $6.98 per full reproduction, an independent researcher can validate 50 papers for the price of one Opus 4.8 reproduction. This democratizes scientific verification in a way that was impossible even 12 months ago.

2. Token Efficiency Matters More Than Raw Capability

Opus 4.8 is almost certainly "smarter" than GLM 5.2 on raw reasoning benchmarks. But GLM's 41% token efficiency advantage—likely driven by faster failure modes and more concise code generation—translates directly to cost savings that compound across large projects.

3. The "Failed Runs" Metric Is Misleading

Conventional wisdom says fewer failures = better model. This experiment suggests the opposite may be true for research tasks: models that fail fast, iterate quickly, and don't overthink can actually be more efficient than models that try to get everything right on the first attempt.

4. Open-Weight + MIT License = Unbeatable for Research

GLM 5.2's full MIT license means researchers can fork, modify, and deploy without API keys, rate limits, or vendor dependencies. When your reproduction costs $6.98 and you own the weights, the economics of scientific verification change fundamentally.

When to Use Which

This experiment doesn't mean GLM 5.2 is universally better than Opus 4.8. The choice depends on your constraints:

Use Case	Winner	Why
Budget-conscious research	GLM 5.2 ✅	6.6× cheaper, comparable results
Maximum accuracy (no cost limit)	Opus 4.8 ✅	Higher ceiling on complex reasoning
Rapid iteration / many experiments	GLM 5.2 ✅	Faster failure modes, lower token burn
Single critical experiment	Opus 4.8 ✅	Fewer total failures, more methodical
Reproducibility / open science	GLM 5.2 ✅	MIT license, local weights, no API lock-in
Multimodal tasks	Neither	GLM is text-only; Opus has vision but wasn't tested here

Conclusion: The $7 Paper Reproduction

The most exciting result from this experiment isn't that GLM 5.2 matched Opus 4.8—it's that it did so at 15% of the cost while running a more comprehensive ablation grid. For $6.98, Z.ai's open-weight model reproduced a cutting-edge RL paper, resolved infrastructure issues, ran alpha sweeps, and validated core claims.

This is the future of AI-assisted research: high-quality scientific verification at fast-food prices, with open weights that anyone can audit, modify, and build upon.

The era of $50 paper reproductions is ending. The era of $7 paper reproductions is here.

And if you're a researcher on a budget? The math is simple: GLM 5.2 just paid for itself 6 times over.

Data Source: Independent reproduction experiment comparing GLM 5.2 (Z.ai, MIT license) vs Claude Opus 4.8 (Anthropic) on Hübotter et al.'s "Reinforcement Learning via Self-Distillation" (SDPO) paper, June 2026. SDPO paper: arXiv:2601.20802.

#AI#Research#GLM#Claude#Open Source#2026