June 26, 2026

7 min read

AI & Machine Learning

Ornith-1.0: The Self-Scaffolding LLM That Teaches Itself to Code Better

> DeepReinforce releases Ornith-1.0—a family of open-source coding models (9B to 397B MoE) that learn to generate their own task-specific scaffolds during RL training. State-of-the-art results on Terminal-Bench, SWE-Bench, and more.

ShareX LinkedIn

🎧 Listen — ~7 min

Audio generating··· Deepgram pipeline queued

~7 min

Verified by Essa Mamdani

Today, we are introducing Ornith-1.0, a self-improving family of open-source models specially designed for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE.

Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

The Self-Scaffolding Revolution

The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.

How It Works

Most coding agents rely on a scaffold (also called a harness) that wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.

Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model's policy. Each RL step runs in two stages:

Scaffold Generation: The model reads the task and its previous scaffold, then proposes a refined scaffold.
Solution Rollout: It uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.

So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.

Guarding Against Reward Hacking

Letting a model write its own scaffold naturally invites reward hacking. A scaffold could read visible test files and hardcode expected outputs, or copy an oracle solution sitting in the environment. DeepReinforce describes three defense layers:

Fixed Outer Trust Boundary: The environment, tool surface, and test isolation stay outside the model's reach. The model evolves only its inner policy scaffold.
Deterministic Monitor: Flags banned actions like reading withheld paths or editing verification scripts. Such trajectories earn zero reward and are excluded from advantage computation.
Frozen LLM Judge: Acts as a veto on top of the verifier, catching intent-level gaming that occurs within the allowed tool surface.

Asynchronous RL Training

For RL training, Ornith-1.0 adopts a pipeline-RL strategy to address the off-line policy problem for long rollouts. A staleness weight downweights older, off-policy tokens according to their age and drops them entirely once a threshold is exceeded. The optimization uses a token-level GRPO objective weighted by this staleness factor.

Benchmark Results: Crushing the Competition

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks.

Flagship: Ornith-1.0-397B

At flagship scale, Ornith-1.0-397B posts results that compete with the best closed-weight models:

Benchmark	Ornith-1.0-397B	Claude Opus 4.7	Claude Opus 4.8	GLM-5.2-744B	MiniMax-M3-428B	DeepSeek-V4-Pro-1.6T
Terminal-Bench 2.1	77.5	70.3	85.0	81.0	64.0	64.0
SWE-Bench Verified	82.4	80.8	87.6	—	—	80.6
SWE-Bench Pro	62.2	64.3	69.2	62.1	59.0	55.4
SWE-Bench Multilingual	78.9	—	—	—	—	76.2
NL2Repo	48.2	—	69.7	48.9	42.1	—
ClawEval Avg	77.1	78.2	—	—	—	75.8

Ornith-1.0-397B beats Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified. It trails only Claude Opus 4.8 (85.0 on TB-2.1, 87.6 on SWE Verified) and the larger GLM-5.2-744B (81.0 on TB-2.1) among all listed models.

Mid-Scale: Ornith-1.0-35B

The 35B model significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.

Edge-Deployable: Ornith-1.0-9B

The compact 9B model delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.

Full Benchmark Table

Ornith-1.0-9B vs Competitors

Benchmark	Ornith-1.0-9B	Qwen3.5-9B	Gemma4-12B	Gemma4-31B
Terminal-Bench 2.1	43.1	21.3	21.0	42.1
SWE-Bench Verified	69.4	53.2	44.2	52.0
SWE-Bench Pro	42.9	31.3	27.6	35.7
SWE-Bench Multilingual	52.0	39.7	32.5	51.7
NL2Repo	27.2	16.2	10.3	15.5
ClawEval Avg	63.1	53.2	32.5	48.5

Deployment: One Command Away

Deployment is straightforward. The 9B model is about 19GB in bf16 and serves comfortably on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint, so standard agent frameworks work without code changes.

vLLM Quick Start

bash
1vllm serve deepreinforce-ai/Ornith-1.0-9B \
2  --served-model-name Ornith-1.0-9B \
3  --host 0.0.0.0 --port 8000 \
4  --max-model-len 262144 \
5  --gpu-memory-utilization 0.90 \
6  --enable-prefix-caching \
7  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
8  --reasoning-parser qwen3 \
9  --trust-remote-code

Python Client

python
1from openai import OpenAI
2
3client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
4
5resp = client.chat.completions.create(
6    model="Ornith-1.0-9B",
7    messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
8    temperature=0.6, top_p=0.95,
9)
10msg = resp.choices[0].message
11print(getattr(msg, "reasoning_content", None))  # the reasoning trace
12print(msg.content)  # the final answer

The reasoning trace returns in reasoning_content, with the answer in content. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20.

Model Variants and Specifications

Model	Parameters	Type	Size (bf16)	Best For
Ornith-1.0-9B	~9B	Dense	~19GB	Edge deployment, single GPU
Ornith-1.0-31B	~31B	Dense	~62GB	High-performance single-node
Ornith-1.0-35B	~35B (3B active)	MoE	~70GB	Best efficiency/performance ratio
Ornith-1.0-397B	~397B (37B active)	MoE	~794GB	Frontier-scale agentic coding

All models ship with FP8 and GGUF builds for faster local serving. The 35B and 397B MoE models use sparse attention, activating only a subset of experts per token for efficient inference.

Why Ornith-1.0 Matters

1. Self-Improving by Design

Unlike traditional coding agents that rely on static, human-designed scaffolds, Ornith-1.0's scaffolds evolve with the model. This means the orchestration logic improves alongside the model's raw capabilities, creating a compounding effect.

2. Open-Weight and MIT Licensed

All models are released under the MIT license, enabling full commercial and research use. No API keys, no rate limits, no vendor lock-in. This is genuine open-source AI infrastructure.

3. Edge to Frontier Coverage

From a 9B model that runs on a single GPU to a 397B MoE flagship, the Ornith-1.0 family covers the full deployment spectrum. Whether you're building a local coding assistant or a enterprise-grade agentic platform, there's a model for you.

4. Real-World Agentic Performance

The benchmarks aren't just academic exercises. Terminal-Bench 2.1 tests real command-line agentic tasks. SWE-Bench Verified tests multi-file GitHub issue resolution. These are the tasks that matter for production coding agents.

The Competitive Landscape

Ornith-1.0 enters a crowded field of open-weight coding models. Let's see how it stacks up:

vs. GLM-5.2-744B: GLM leads on Terminal-Bench 2.1 (81.0 vs 77.5) but is text-only and lacks the self-scaffolding innovation. Ornith-1.0 is more versatile for agentic workflows.
vs. MiniMax-M3-428B: MiniMax M3 is natively multimodal (video, image, desktop) but trails Ornith-1.0 on pure coding benchmarks (64.0 vs 77.5 on TB-2.1).
vs. DeepSeek-V4-Pro-1.6T: DeepSeek's flagship is massive but underperforms Ornith-1.0 on both Terminal-Bench and SWE-Bench (64.0 vs 77.5, 80.6 vs 82.4).
vs. Claude Opus 4.8: The closed-weight champion still leads on most benchmarks, but Ornith-1.0-397B is competitive—and fully open.

Conclusion

Ornith-1.0 represents a paradigm shift in how we train coding agents. By treating the scaffold as a learnable object that co-evolves with the model's policy, DeepReinforce has created a self-improving system that gets better at both solving tasks and orchestrating the solution process.

The results speak for themselves. A 9B model that outperforms 31B competitors. A 35B model that beats 397B models. A 397B flagship that challenges Claude Opus 4.8. All under MIT license, all ready to deploy.

For developers, researchers, and AI engineers, Ornith-1.0 is not just another model release—it's a new way of thinking about agentic coding. The future of AI-assisted software engineering just got a lot more interesting.

📖 Tech Blog: deep-reinforce.com/ornith_1_0.html
🤗 Hugging Face: huggingface.co/collections/deepreinforce-ai/ornith-10
📄 License: MIT (Full commercial and research use)

Sources: DeepReinforce Blog, MarkTechPost, Hugging Face Model Hub, X/Twitter (@rohanpaul_ai, @TechByMarkandey), AI Weekly, Reddit r/LocalLLaMA

#AI#Open Source#Coding#LLM#Reinforcement Learning#2026