Ornith-1.0: The Self-Scaffolding LLM That Teaches Itself to Code Better
> DeepReinforce releases Ornith-1.0—a family of open-source coding models (9B to 397B MoE) that learn to generate their own task-specific scaffolds during RL training. State-of-the-art results on Terminal-Bench, SWE-Bench, and more.
Today, we are introducing Ornith-1.0, a self-improving family of open-source models specially designed for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE.
Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
The Self-Scaffolding Revolution
The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.
How It Works
Most coding agents rely on a scaffold (also called a harness) that wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.
Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model's policy. Each RL step runs in two stages:
- Scaffold Generation: The model reads the task and its previous scaffold, then proposes a refined scaffold.
- Solution Rollout: It uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.
So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.
Guarding Against Reward Hacking
Letting a model write its own scaffold naturally invites reward hacking. A scaffold could read visible test files and hardcode expected outputs, or copy an oracle solution sitting in the environment. DeepReinforce describes three defense layers:
-
Fixed Outer Trust Boundary: The environment, tool surface, and test isolation stay outside the model's reach. The model evolves only its inner policy scaffold.
-
Deterministic Monitor: Flags banned actions like reading withheld paths or editing verification scripts. Such trajectories earn zero reward and are excluded from advantage computation.
-
Frozen LLM Judge: Acts as a veto on top of the verifier, catching intent-level gaming that occurs within the allowed tool surface.
Asynchronous RL Training
For RL training, Ornith-1.0 adopts a pipeline-RL strategy to address the off-line policy problem for long rollouts. A staleness weight downweights older, off-policy tokens according to their age and drops them entirely once a threshold is exceeded. The optimization uses a token-level GRPO objective weighted by this staleness factor.
Benchmark Results: Crushing the Competition
Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks.
Flagship: Ornith-1.0-397B
At flagship scale, Ornith-1.0-397B posts results that compete with the best closed-weight models:
| Benchmark | Ornith-1.0-397B | Claude Opus 4.7 | Claude Opus 4.8 | GLM-5.2-744B | MiniMax-M3-428B | DeepSeek-V4-Pro-1.6T |
|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 70.3 | 85.0 | 81.0 | 64.0 | 64.0 |
| SWE-Bench Verified | 82.4 | 80.8 | 87.6 | — | — | 80.6 |
| SWE-Bench Pro | 62.2 | 64.3 | 69.2 | 62.1 | 59.0 | 55.4 |
| SWE-Bench Multilingual | 78.9 | — | — | — | — | 76.2 |
| NL2Repo | 48.2 | — | 69.7 | 48.9 | 42.1 | — |
| ClawEval Avg | 77.1 | 78.2 | — | — | — | 75.8 |
Ornith-1.0-397B beats Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified. It trails only Claude Opus 4.8 (85.0 on TB-2.1, 87.6 on SWE Verified) and the larger GLM-5.2-744B (81.0 on TB-2.1) among all listed models.
Mid-Scale: Ornith-1.0-35B
The 35B model significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.
Edge-Deployable: Ornith-1.0-9B
The compact 9B model delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.
Full Benchmark Table
Ornith-1.0-9B vs Competitors
| Benchmark | Ornith-1.0-9B | Qwen3.5-9B | Gemma4-12B | Gemma4-31B |
|---|---|---|---|---|
| Terminal-Bench 2.1 | 43.1 | 21.3 | 21.0 | 42.1 |
| SWE-Bench Verified | 69.4 | 53.2 | 44.2 | 52.0 |
| SWE-Bench Pro | 42.9 | 31.3 | 27.6 | 35.7 |
| SWE-Bench Multilingual | 52.0 | 39.7 | 32.5 | 51.7 |
| NL2Repo | 27.2 | 16.2 | 10.3 | 15.5 |
| ClawEval Avg | 63.1 | 53.2 | 32.5 | 48.5 |
Deployment: One Command Away
Deployment is straightforward. The 9B model is about 19GB in bf16 and serves comfortably on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint, so standard agent frameworks work without code changes.
vLLM Quick Start
bash1vllm serve deepreinforce-ai/Ornith-1.0-9B \ 2 --served-model-name Ornith-1.0-9B \ 3 --host 0.0.0.0 --port 8000 \ 4 --max-model-len 262144 \ 5 --gpu-memory-utilization 0.90 \ 6 --enable-prefix-caching \ 7 --enable-auto-tool-choice --tool-call-parser qwen3_xml \ 8 --reasoning-parser qwen3 \ 9 --trust-remote-code
Python Client
python1from openai import OpenAI 2 3client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") 4 5resp = client.chat.completions.create( 6 model="Ornith-1.0-9B", 7 messages=[{"role": "user", "content": "Write a Python is_prime(n)."}], 8 temperature=0.6, top_p=0.95, 9) 10msg = resp.choices[0].message 11print(getattr(msg, "reasoning_content", None)) # the reasoning trace 12print(msg.content) # the final answer
The reasoning trace returns in reasoning_content, with the answer in content. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20.
Model Variants and Specifications
| Model | Parameters | Type | Size (bf16) | Best For |
|---|---|---|---|---|
| Ornith-1.0-9B | ~9B | Dense | ~19GB | Edge deployment, single GPU |
| Ornith-1.0-31B | ~31B | Dense | ~62GB | High-performance single-node |
| Ornith-1.0-35B | ~35B (3B active) | MoE | ~70GB | Best efficiency/performance ratio |
| Ornith-1.0-397B | ~397B (37B active) | MoE | ~794GB | Frontier-scale agentic coding |
All models ship with FP8 and GGUF builds for faster local serving. The 35B and 397B MoE models use sparse attention, activating only a subset of experts per token for efficient inference.
Why Ornith-1.0 Matters
1. Self-Improving by Design
Unlike traditional coding agents that rely on static, human-designed scaffolds, Ornith-1.0's scaffolds evolve with the model. This means the orchestration logic improves alongside the model's raw capabilities, creating a compounding effect.
2. Open-Weight and MIT Licensed
All models are released under the MIT license, enabling full commercial and research use. No API keys, no rate limits, no vendor lock-in. This is genuine open-source AI infrastructure.
3. Edge to Frontier Coverage
From a 9B model that runs on a single GPU to a 397B MoE flagship, the Ornith-1.0 family covers the full deployment spectrum. Whether you're building a local coding assistant or a enterprise-grade agentic platform, there's a model for you.
4. Real-World Agentic Performance
The benchmarks aren't just academic exercises. Terminal-Bench 2.1 tests real command-line agentic tasks. SWE-Bench Verified tests multi-file GitHub issue resolution. These are the tasks that matter for production coding agents.
The Competitive Landscape
Ornith-1.0 enters a crowded field of open-weight coding models. Let's see how it stacks up:
- vs. GLM-5.2-744B: GLM leads on Terminal-Bench 2.1 (81.0 vs 77.5) but is text-only and lacks the self-scaffolding innovation. Ornith-1.0 is more versatile for agentic workflows.
- vs. MiniMax-M3-428B: MiniMax M3 is natively multimodal (video, image, desktop) but trails Ornith-1.0 on pure coding benchmarks (64.0 vs 77.5 on TB-2.1).
- vs. DeepSeek-V4-Pro-1.6T: DeepSeek's flagship is massive but underperforms Ornith-1.0 on both Terminal-Bench and SWE-Bench (64.0 vs 77.5, 80.6 vs 82.4).
- vs. Claude Opus 4.8: The closed-weight champion still leads on most benchmarks, but Ornith-1.0-397B is competitive—and fully open.
Conclusion
Ornith-1.0 represents a paradigm shift in how we train coding agents. By treating the scaffold as a learnable object that co-evolves with the model's policy, DeepReinforce has created a self-improving system that gets better at both solving tasks and orchestrating the solution process.
The results speak for themselves. A 9B model that outperforms 31B competitors. A 35B model that beats 397B models. A 397B flagship that challenges Claude Opus 4.8. All under MIT license, all ready to deploy.
For developers, researchers, and AI engineers, Ornith-1.0 is not just another model release—it's a new way of thinking about agentic coding. The future of AI-assisted software engineering just got a lot more interesting.
📖 Tech Blog: deep-reinforce.com/ornith_1_0.html
🤗 Hugging Face: huggingface.co/collections/deepreinforce-ai/ornith-10
📄 License: MIT (Full commercial and research use)
Sources: DeepReinforce Blog, MarkTechPost, Hugging Face Model Hub, X/Twitter (@rohanpaul_ai, @TechByMarkandey), AI Weekly, Reddit r/LocalLLaMA