$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
7 min read
AI & Machine Learning

Ornith-1.0: The Self-Scaffolding LLM That Teaches Itself to Code Better

> DeepReinforce releases Ornith-1.0—a family of open-source coding models (9B to 397B MoE) that learn to generate their own task-specific scaffolds during RL training. State-of-the-art results on Terminal-Bench, SWE-Bench, and more.

Audio version coming soon
Ornith-1.0: The Self-Scaffolding LLM That Teaches Itself to Code Better
Verified by Essa Mamdani

Today, we are introducing Ornith-1.0, a self-improving family of open-source models specially designed for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE.

Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

The Self-Scaffolding Revolution

The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.

How It Works

Most coding agents rely on a scaffold (also called a harness) that wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.

Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model's policy. Each RL step runs in two stages:

  1. Scaffold Generation: The model reads the task and its previous scaffold, then proposes a refined scaffold.
  2. Solution Rollout: It uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.

So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.

Guarding Against Reward Hacking

Letting a model write its own scaffold naturally invites reward hacking. A scaffold could read visible test files and hardcode expected outputs, or copy an oracle solution sitting in the environment. DeepReinforce describes three defense layers:

  1. Fixed Outer Trust Boundary: The environment, tool surface, and test isolation stay outside the model's reach. The model evolves only its inner policy scaffold.

  2. Deterministic Monitor: Flags banned actions like reading withheld paths or editing verification scripts. Such trajectories earn zero reward and are excluded from advantage computation.

  3. Frozen LLM Judge: Acts as a veto on top of the verifier, catching intent-level gaming that occurs within the allowed tool surface.

Asynchronous RL Training

For RL training, Ornith-1.0 adopts a pipeline-RL strategy to address the off-line policy problem for long rollouts. A staleness weight downweights older, off-policy tokens according to their age and drops them entirely once a threshold is exceeded. The optimization uses a token-level GRPO objective weighted by this staleness factor.

Benchmark Results: Crushing the Competition

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks.

Flagship: Ornith-1.0-397B

At flagship scale, Ornith-1.0-397B posts results that compete with the best closed-weight models:

BenchmarkOrnith-1.0-397BClaude Opus 4.7Claude Opus 4.8GLM-5.2-744BMiniMax-M3-428BDeepSeek-V4-Pro-1.6T
Terminal-Bench 2.177.570.385.081.064.064.0
SWE-Bench Verified82.480.887.680.6
SWE-Bench Pro62.264.369.262.159.055.4
SWE-Bench Multilingual78.976.2
NL2Repo48.269.748.942.1
ClawEval Avg77.178.275.8

Ornith-1.0-397B beats Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified. It trails only Claude Opus 4.8 (85.0 on TB-2.1, 87.6 on SWE Verified) and the larger GLM-5.2-744B (81.0 on TB-2.1) among all listed models.

Mid-Scale: Ornith-1.0-35B

The 35B model significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.

Edge-Deployable: Ornith-1.0-9B

The compact 9B model delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.

Full Benchmark Table

Ornith-1.0-9B vs Competitors

BenchmarkOrnith-1.0-9BQwen3.5-9BGemma4-12BGemma4-31B
Terminal-Bench 2.143.121.321.042.1
SWE-Bench Verified69.453.244.252.0
SWE-Bench Pro42.931.327.635.7
SWE-Bench Multilingual52.039.732.551.7
NL2Repo27.216.210.315.5
ClawEval Avg63.153.232.548.5

Deployment: One Command Away

Deployment is straightforward. The 9B model is about 19GB in bf16 and serves comfortably on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint, so standard agent frameworks work without code changes.

vLLM Quick Start

bash
1vllm serve deepreinforce-ai/Ornith-1.0-9B \
2  --served-model-name Ornith-1.0-9B \
3  --host 0.0.0.0 --port 8000 \
4  --max-model-len 262144 \
5  --gpu-memory-utilization 0.90 \
6  --enable-prefix-caching \
7  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
8  --reasoning-parser qwen3 \
9  --trust-remote-code

Python Client

python
1from openai import OpenAI
2
3client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
4
5resp = client.chat.completions.create(
6    model="Ornith-1.0-9B",
7    messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
8    temperature=0.6, top_p=0.95,
9)
10msg = resp.choices[0].message
11print(getattr(msg, "reasoning_content", None))  # the reasoning trace
12print(msg.content)  # the final answer

The reasoning trace returns in reasoning_content, with the answer in content. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20.

Model Variants and Specifications

ModelParametersTypeSize (bf16)Best For
Ornith-1.0-9B~9BDense~19GBEdge deployment, single GPU
Ornith-1.0-31B~31BDense~62GBHigh-performance single-node
Ornith-1.0-35B~35B (3B active)MoE~70GBBest efficiency/performance ratio
Ornith-1.0-397B~397B (37B active)MoE~794GBFrontier-scale agentic coding

All models ship with FP8 and GGUF builds for faster local serving. The 35B and 397B MoE models use sparse attention, activating only a subset of experts per token for efficient inference.

Why Ornith-1.0 Matters

1. Self-Improving by Design

Unlike traditional coding agents that rely on static, human-designed scaffolds, Ornith-1.0's scaffolds evolve with the model. This means the orchestration logic improves alongside the model's raw capabilities, creating a compounding effect.

2. Open-Weight and MIT Licensed

All models are released under the MIT license, enabling full commercial and research use. No API keys, no rate limits, no vendor lock-in. This is genuine open-source AI infrastructure.

3. Edge to Frontier Coverage

From a 9B model that runs on a single GPU to a 397B MoE flagship, the Ornith-1.0 family covers the full deployment spectrum. Whether you're building a local coding assistant or a enterprise-grade agentic platform, there's a model for you.

4. Real-World Agentic Performance

The benchmarks aren't just academic exercises. Terminal-Bench 2.1 tests real command-line agentic tasks. SWE-Bench Verified tests multi-file GitHub issue resolution. These are the tasks that matter for production coding agents.

The Competitive Landscape

Ornith-1.0 enters a crowded field of open-weight coding models. Let's see how it stacks up:

  • vs. GLM-5.2-744B: GLM leads on Terminal-Bench 2.1 (81.0 vs 77.5) but is text-only and lacks the self-scaffolding innovation. Ornith-1.0 is more versatile for agentic workflows.
  • vs. MiniMax-M3-428B: MiniMax M3 is natively multimodal (video, image, desktop) but trails Ornith-1.0 on pure coding benchmarks (64.0 vs 77.5 on TB-2.1).
  • vs. DeepSeek-V4-Pro-1.6T: DeepSeek's flagship is massive but underperforms Ornith-1.0 on both Terminal-Bench and SWE-Bench (64.0 vs 77.5, 80.6 vs 82.4).
  • vs. Claude Opus 4.8: The closed-weight champion still leads on most benchmarks, but Ornith-1.0-397B is competitive—and fully open.

Conclusion

Ornith-1.0 represents a paradigm shift in how we train coding agents. By treating the scaffold as a learnable object that co-evolves with the model's policy, DeepReinforce has created a self-improving system that gets better at both solving tasks and orchestrating the solution process.

The results speak for themselves. A 9B model that outperforms 31B competitors. A 35B model that beats 397B models. A 397B flagship that challenges Claude Opus 4.8. All under MIT license, all ready to deploy.

For developers, researchers, and AI engineers, Ornith-1.0 is not just another model release—it's a new way of thinking about agentic coding. The future of AI-assisted software engineering just got a lot more interesting.

📖 Tech Blog: deep-reinforce.com/ornith_1_0.html
🤗 Hugging Face: huggingface.co/collections/deepreinforce-ai/ornith-10
📄 License: MIT (Full commercial and research use)

Sources: DeepReinforce Blog, MarkTechPost, Hugging Face Model Hub, X/Twitter (@rohanpaul_ai, @TechByMarkandey), AI Weekly, Reddit r/LocalLLaMA

#AI#Open Source#Coding#LLM#Reinforcement Learning#2026