Gemma 4 Fine-Tuning: Production Recipes for 2026
> QLoRA, DeepSpeed full fine-tuning, and function calling fine-tune recipes. Data validation, hyperparameter tuning, and deployment monitoring.
Gemma 4 Fine-Tuning: Production Recipes for 2026
Published: May 2026
Author: Essa Mamdani
Category: ML Engineering / Model Training
Read Time: 16 minutes
The Fine-Tuning Landscape in 2026
Pre-trained models are commodities. Fine-tuned models are competitive advantages.
Gemma 4's Apache 2.0 license means you can modify, redistribute, and commercialize your fine-tuned variants without legal ambiguity. This article covers production-grade fine-tuning—from data preparation to deployment—based on real workloads at scale.
Choosing Your Fine-Tuning Strategy
| Strategy | Parameters Updated | VRAM (9B) | VRAM (27B) | Use Case |
|---|---|---|---|---|
| Full Fine-tuning | 100% | 36 GB | 108 GB | New domain (medical, legal) |
| LoRA | 0.1-1% | 12 GB | 36 GB | Task adaptation (classification, extraction) |
| QLoRA | 0.1-1% | 6 GB | 18 GB | Rapid prototyping, consumer GPUs |
| DoRA | 0.2-2% | 14 GB | 42 GB | Higher accuracy, slightly more compute |
| Adapter Fusion | N/A (inference only) | Base model | Base model | Multi-task without retraining |
Rule of thumb: Start with QLoRA. Move to full fine-tuning only if QLoRA doesn't converge.
Data Preparation: The 80% That Matters
Conversation Format
Gemma 4 expects specific chat templates. Get this wrong and training is wasted:
python1# Correct format for Gemma 4 instruction tuning 2{ 3 "messages": [ 4 { 5 "role": "system", 6 "content": "You are a technical support specialist for a PostgreSQL database product." 7 }, 8 { 9 "role": "user", 10 "content": "I'm getting 'connection refused' errors. What should I check?" 11 }, 12 { 13 "role": "assistant", 14 "content": "Let's troubleshoot systematically:\n\n1. **Verify PostgreSQL is running**..." 15 } 16 ] 17}
Data Quality Checks
python1import pandas as pd 2from transformers import AutoTokenizer 3 4class DataValidator: 5 def __init__(self, model_name="google/gemma-4-9b"): 6 self.tokenizer = AutoTokenizer.from_pretrained(model_name) 7 self.max_length = 8192 8 9 def validate_dataset(self, dataset_path: str): 10 df = pd.read_json(dataset_path, lines=True) 11 issues = [] 12 13 for idx, row in df.iterrows(): 14 messages = row["messages"] 15 16 # Check 1: Valid roles 17 valid_roles = {"system", "user", "assistant", "tool"} 18 for msg in messages: 19 if msg["role"] not in valid_roles: 20 issues.append(f"Row {idx}: Invalid role '{msg['role']}'") 21 22 # Check 2: Conversation alternates user/assistant 23 for i in range(1, len(messages)): 24 if messages[i]["role"] == messages[i-1]["role"]: 25 issues.append(f"Row {idx}: Duplicate roles at positions {i-1},{i}") 26 27 # Check 3: Token count within limits 28 full_text = "\n".join(m["content"] for m in messages) 29 tokens = self.tokenizer.encode(full_text) 30 if len(tokens) > self.max_length: 31 issues.append(f"Row {idx}: {len(tokens)} tokens > {self.max_length} limit") 32 33 # Check 4: Assistant responses aren't empty 34 assistant_msgs = [m for m in messages if m["role"] == "assistant"] 35 if not all(len(m["content"]) > 20 for m in assistant_msgs): 36 issues.append(f"Row {idx}: Short assistant response detected") 37 38 return issues
Data Augmentation for Small Datasets
If you have <1000 examples, augment intelligently:
python1class DataAugmenter: 2 def paraphrase_user_queries(self, dataset, num_variations=3): 3 """Generate semantic variations of user prompts""" 4 5 paraphrase_prompt = """Generate {n} different ways a user might ask this question. 6 Keep the intent identical but vary phrasing, formality, and length: 7 8 Original: {original} 9 10 Variations:""" 11 12 augmented = [] 13 for example in dataset: 14 user_msg = next(m for m in example["messages"] if m["role"] == "user") 15 16 variations = self.model.generate( 17 paraphrase_prompt.format(n=num_variations, original=user_msg["content"]) 18 ) 19 20 for variation in variations: 21 new_example = deepcopy(example) 22 user_msg_idx = next(i for i, m in enumerate(new_example["messages"]) if m["role"] == "user") 23 new_example["messages"][user_msg_idx]["content"] = variation 24 augmented.append(new_example) 25 26 return dataset + augmented
Fine-Tuning Implementations
Recipe 1: QLoRA with Unsloth (Fastest)
python1# unsloth_trainer.py 2from unsloth import FastLanguageModel 3from trl import SFTTrainer 4from transformers import TrainingArguments 5from datasets import load_dataset 6 7# Load Gemma 4 with 4-bit quantization 8model, tokenizer = FastLanguageModel.from_pretrained( 9 model_name="google/gemma-4-9b", 10 max_seq_length=8192, 11 load_in_4bit=True, 12 fast_inference=True, 13) 14 15# Add LoRA adapters 16model = FastLanguageModel.get_peft_model( 17 model, 18 r=64, # LoRA rank (higher = more capacity) 19 lora_alpha=128, # Scaling factor (typically 2x rank) 20 target_modules=[ 21 "q_proj", "k_proj", "v_proj", "o_proj", 22 "gate_proj", "up_proj", "down_proj", 23 ], 24 lora_dropout=0.05, 25 bias="none", 26 use_gradient_checkpointing="unsloth", 27 random_state=42, 28) 29 30# Prepare dataset 31dataset = load_dataset("json", data_files="training_data.jsonl", split="train") 32 33# Training configuration 34training_args = TrainingArguments( 35 output_dir="./gemma-4-9b-finetuned", 36 num_train_epochs=3, 37 per_device_train_batch_size=2, 38 gradient_accumulation_steps=4, 39 warmup_steps=100, 40 learning_rate=2e-4, 41 optim="adamw_8bit", 42 weight_decay=0.01, 43 lr_scheduler_type="cosine", 44 seed=42, 45 report_to="wandb", 46) 47 48# Initialize trainer 49trainer = SFTTrainer( 50 model=model, 51 tokenizer=tokenizer, 52 train_dataset=dataset, 53 dataset_text_field="messages", 54 max_seq_length=8192, 55 args=training_args, 56) 57 58# Train 59trainer.train() 60 61# Save adapters 62model.save_pretrained("./gemma-4-9b-lora") 63tokenizer.save_pretrained("./gemma-4-9b-lora")
Performance on RTX 4090 (24GB):
- 9B model: ~45 minutes per epoch (10K samples)
- 27B model: ~2.5 hours per epoch (with gradient checkpointing)
Recipe 2: Multi-GPU Full Fine-tuning (DeepSpeed)
For domains where LoRA isn't enough (medical imaging reports, legal contracts):
python1# deepspeed_config.json 2{ 3 "fp16": { 4 "enabled": true, 5 "loss_scale": 0, 6 "loss_scale_window": 1000, 7 "initial_scale_power": 16, 8 "hysteresis": 2, 9 "min_loss_scale": 1 10 }, 11 "zero_optimization": { 12 "stage": 2, 13 "offload_optimizer": { 14 "device": "cpu", 15 "pin_memory": true 16 }, 17 "allgather_partitions": true, 18 "allgather_bucket_size": 2e8, 19 "overlap_comm": true, 20 "reduce_scatter": true, 21 "reduce_bucket_size": 2e8, 22 "contiguous_gradients": true 23 }, 24 "train_batch_size": "auto", 25 "train_micro_batch_size_per_gpu": "auto", 26 "gradient_accumulation_steps": "auto" 27}
python1# full_finetune.py 2from accelerate import Accelerator 3from transformers import AutoModelForCausalLM, TrainingArguments 4from trl import SFTTrainer 5 6accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin) 7 8model = AutoModelForCausalLM.from_pretrained( 9 "google/gemma-4-27b", 10 torch_dtype=torch.bfloat16, 11) 12 13# DeepSpeed ZeRO-2 splits optimizer states across GPUs 14# 4x A100 80GB can full fine-tune 27B model 15training_args = TrainingArguments( 16 output_dir="./gemma-4-27b-full", 17 deepspeed="./deepspeed_config.json", 18 num_train_epochs=2, 19 per_device_train_batch_size=1, 20 gradient_accumulation_steps=8, 21 learning_rate=1e-5, # Lower LR for full fine-tuning 22 warmup_ratio=0.03, 23 lr_scheduler_type="cosine", 24 save_strategy="epoch", 25 logging_steps=10, 26) 27 28trainer = SFTTrainer( 29 model=model, 30 train_dataset=dataset, 31 args=training_args, 32) 33 34trainer.train()
Recipe 3: Function Calling Fine-tune
If you're building agents, fine-tune specifically for tool use:
python1# Prepare tool-use training data 2{ 3 "messages": [ 4 {"role": "user", "content": "Book me a flight to Tokyo next Tuesday"}, 5 {"role": "assistant", "content": "", "tool_calls": [ 6 {"name": "search_flights", "arguments": {"destination": "Tokyo", "date": "2026-05-12"}} 7 ]}, 8 {"role": "tool", "name": "search_flights", "content": "[...flight data...]"}, 9 {"role": "assistant", "content": "I found 3 flights... Which would you prefer?"} 10 ] 11} 12 13# Training with tool schemas embedded 14tool_schema = json.dumps([{ 15 "type": "function", 16 "function": { 17 "name": "search_flights", 18 "parameters": {...} 19 } 20}]) 21 22# Prepend schema to system prompt 23system_msg = f"You have access to these tools: {tool_schema}\nUse them when appropriate."
Hyperparameter Tuning
Learning Rate Finder
python1from torch.optim import AdamW 2import matplotlib.pyplot as plt 3 4class LRFinder: 5 def find(self, model, dataset, min_lr=1e-6, max_lr=1e-3, num_steps=100): 6 optimizer = AdamW(model.parameters(), lr=min_lr) 7 8 # Exponential LR increase 9 lr_mult = (max_lr / min_lr) ** (1 / num_steps) 10 lr = min_lr 11 12 losses = [] 13 lrs = [] 14 15 for step, batch in enumerate(dataset): 16 if step >= num_steps: 17 break 18 19 optimizer.param_groups[0]["lr"] = lr 20 21 loss = model(**batch).loss 22 loss.backward() 23 optimizer.step() 24 optimizer.zero_grad() 25 26 losses.append(loss.item()) 27 lrs.append(lr) 28 29 lr *= lr_mult 30 31 # Plot and identify steepest descent 32 plt.plot(lrs, losses) 33 plt.xscale("log") 34 plt.xlabel("Learning Rate") 35 plt.ylabel("Loss") 36 37 # Optimal LR is typically 10x lower than minimum loss point 38 optimal_idx = losses.index(min(losses)) 39 optimal_lr = lrs[optimal_idx] / 10 40 41 return optimal_lr
Optimal Configurations by Task
| Task | Model | Rank | Alpha | LR | Epochs | Batch Size |
|---|---|---|---|---|---|---|
| Classification | 9B | 8 | 16 | 3e-4 | 3 | 8 |
| Chat/QA | 9B | 64 | 128 | 2e-4 | 3 | 4 |
| Code generation | 27B | 128 | 256 | 1e-4 | 5 | 2 |
| Function calling | 9B | 32 | 64 | 2e-4 | 4 | 4 |
| Summarization | 27B | 16 | 32 | 2e-4 | 2 | 4 |
Evaluation That Actually Predicts Production Performance
Automated Evaluation Pipeline
python1from deepeval import evaluate 2from deepeval.metrics import GEval, FaithfulnessMetric 3from deepeval.test_case import LLMTestCase 4 5class ProductionEvaluator: 6 def __init__(self, base_model, finetuned_model): 7 self.base = base_model 8 self.finetuned = finetuned_model 9 self.metrics = [ 10 GEval( 11 name="Correctness", 12 criteria="Determine if the response is factually correct", 13 evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT] 14 ), 15 FaithfulnessMetric(threshold=0.7), 16 ] 17 18 def evaluate_on_test_set(self, test_cases: list): 19 """Compare base vs fine-tuned on identical inputs""" 20 21 results = [] 22 for case in test_cases: 23 base_output = self.base.generate(case["input"]) 24 finetuned_output = self.finetuned.generate(case["input"]) 25 26 test_case = LLMTestCase( 27 input=case["input"], 28 actual_output=finetuned_output, 29 expected_output=case["expected"], 30 retrieval_context=case.get("context", []) 31 ) 32 33 scores = evaluate([test_case], self.metrics) 34 35 results.append({ 36 "input": case["input"][:100], 37 "base_score": self.score_output(base_output, case["expected"]), 38 "finetuned_score": scores[0].score, 39 "improvement": scores[0].score - self.score_output(base_output, case["expected"]), 40 "regression": scores[0].score < 0.5 41 }) 42 43 return results
Regression Testing
python1class RegressionTester: 2 def test_for_catastrophic_forgetting(self, base_model, finetuned_model): 3 """Ensure model didn't forget general capabilities""" 4 5 general_tasks = [ 6 "Explain quantum computing in simple terms", 7 "Write a Python function to reverse a string", 8 "What is the capital of Australia?", 9 "Summarize the theory of relativity" 10 ] 11 12 regressions = [] 13 for task in general_tasks: 14 base_output = base_model.generate(task) 15 finetuned_output = finetuned_model.generate(task) 16 17 similarity = self.semantic_similarity(base_output, finetuned_output) 18 19 if similarity < 0.85: # Significant divergence 20 regressions.append({ 21 "task": task, 22 "base": base_output[:200], 23 "finetuned": finetuned_output[:200], 24 "similarity": similarity 25 }) 26 27 return regressions
Deployment of Fine-Tuned Models
Merge Adapters for Production
python1from peft import AutoPeftModelForCausalLM 2 3# Load fine-tuned model (base + adapters) 4model = AutoPeftModelForCausalLM.from_pretrained( 5 "./gemma-4-9b-finetuned", 6 device_map="auto" 7) 8 9# Merge LoRA weights into base model for single-file deployment 10merged_model = model.merge_and_unload() 11merged_model.save_pretrained("./gemma-4-9b-merged") 12 13# Result: single 18GB file, no adapter loading overhead 14# 15-20% faster inference than adapter mode
vLLM Deployment with Fine-Tuned Model
bash1# Serve merged model with vLLM 2python -m vllm.entrypoints.openai.api_server \ 3 --model ./gemma-4-9b-merged \ 4 --served-model-name gemma-4-customer-support \ 5 --max-model-len 16384 \ 6 --quantization fp8 \ 7 --gpu-memory-utilization 0.92 8 9# Test 10curl http://localhost:8000/v1/chat/completions \ 11 -H "Content-Type: application/json" \ 12 -d '{ 13 "model": "gemma-4-customer-support", 14 "messages": [{"role": "user", "content": "My database is slow"}] 15 }'
Monitoring Fine-Tuned Models in Production
python1from prometheus_client import Counter, Histogram, Gauge 2import time 3 4class ProductionMonitor: 5 def __init__(self): 6 self.inference_latency = Histogram( 7 "gemma4_inference_latency_seconds", 8 "End-to-end inference time", 9 ["model_version", "quantization"] 10 ) 11 self.token_throughput = Gauge( 12 "gemma4_tokens_per_second", 13 "Generation speed", 14 ["model_version"] 15 ) 16 self.drift_detector = DriftDetector(window_size=1000) 17 18 def monitor_inference(self, model_version, quantization, prompt, output): 19 # Track latency 20 start = time.time() 21 result = self.model.generate(prompt) 22 latency = time.time() - start 23 24 self.inference_latency.labels( 25 model_version=model_version, 26 quantization=quantization 27 ).observe(latency) 28 29 # Detect output drift 30 self.drift_detector.add_sample(output) 31 if self.drift_detector.detect_drift(): 32 self.alert_engineer("Output distribution drift detected!") 33 34 # Track hallucination indicators 35 if self.contains_unexpected_patterns(output): 36 self.log_suspicious_output(prompt, output) 37 38 def contains_unexpected_patterns(self, text: str) -> bool: 39 """Detect potential fine-tuning degradation""" 40 indicators = [ 41 r"I'm sorry, but I", # Generic refusal patterns returning 42 r"As an AI language model", # Base model leakage 43 r"\[object Object\]", # JSON formatting errors 44 r"undefined|null", # Code generation regression 45 ] 46 return any(re.search(pattern, text) for pattern in indicators)
The Bottom Line
Fine-tuning Gemma 4 in 2026 is remarkably accessible:
- QLoRA on consumer GPUs (RTX 4090): 9B models in under an hour
- Unsloth: 2x faster training, 50% less memory than standard PEFT
- DeepSpeed: Full fine-tuning of 27B models on 4x A100s
- Apache 2.0: Ship your fine-tuned model without legal review
The key differentiator isn't the model—it's your data quality and evaluation rigor. A 9B model fine-tuned on 5K high-quality examples beats a 27B model on generic data.
Start with QLoRA. Validate with regression tests. Deploy merged models. Monitor for drift.
That's the production recipe.
Essa Mamdani is the creator of AutoBlogging.Pro and has fine-tuned more language models than he cares to count.
Follow: essa.mamdani.com | GitHub: @essamamdani