$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
8 min read
Artificial Intelligence

Gemma 4 Fine-Tuning: Production Recipes for 2026

> QLoRA, DeepSpeed full fine-tuning, and function calling fine-tune recipes. Data validation, hyperparameter tuning, and deployment monitoring.

Audio version coming soon
Gemma 4 Fine-Tuning: Production Recipes for 2026
Verified by Essa Mamdani

Gemma 4 Fine-Tuning: Production Recipes for 2026

Published: May 2026
Author: Essa Mamdani
Category: ML Engineering / Model Training
Read Time: 16 minutes


The Fine-Tuning Landscape in 2026

Pre-trained models are commodities. Fine-tuned models are competitive advantages.

Gemma 4's Apache 2.0 license means you can modify, redistribute, and commercialize your fine-tuned variants without legal ambiguity. This article covers production-grade fine-tuning—from data preparation to deployment—based on real workloads at scale.


Choosing Your Fine-Tuning Strategy

StrategyParameters UpdatedVRAM (9B)VRAM (27B)Use Case
Full Fine-tuning100%36 GB108 GBNew domain (medical, legal)
LoRA0.1-1%12 GB36 GBTask adaptation (classification, extraction)
QLoRA0.1-1%6 GB18 GBRapid prototyping, consumer GPUs
DoRA0.2-2%14 GB42 GBHigher accuracy, slightly more compute
Adapter FusionN/A (inference only)Base modelBase modelMulti-task without retraining

Rule of thumb: Start with QLoRA. Move to full fine-tuning only if QLoRA doesn't converge.


Data Preparation: The 80% That Matters

Conversation Format

Gemma 4 expects specific chat templates. Get this wrong and training is wasted:

python
1# Correct format for Gemma 4 instruction tuning
2{
3    "messages": [
4        {
5            "role": "system",
6            "content": "You are a technical support specialist for a PostgreSQL database product."
7        },
8        {
9            "role": "user", 
10            "content": "I'm getting 'connection refused' errors. What should I check?"
11        },
12        {
13            "role": "assistant",
14            "content": "Let's troubleshoot systematically:\n\n1. **Verify PostgreSQL is running**..."
15        }
16    ]
17}

Data Quality Checks

python
1import pandas as pd
2from transformers import AutoTokenizer
3
4class DataValidator:
5    def __init__(self, model_name="google/gemma-4-9b"):
6        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
7        self.max_length = 8192
8    
9    def validate_dataset(self, dataset_path: str):
10        df = pd.read_json(dataset_path, lines=True)
11        issues = []
12        
13        for idx, row in df.iterrows():
14            messages = row["messages"]
15            
16            # Check 1: Valid roles
17            valid_roles = {"system", "user", "assistant", "tool"}
18            for msg in messages:
19                if msg["role"] not in valid_roles:
20                    issues.append(f"Row {idx}: Invalid role '{msg['role']}'")
21            
22            # Check 2: Conversation alternates user/assistant
23            for i in range(1, len(messages)):
24                if messages[i]["role"] == messages[i-1]["role"]:
25                    issues.append(f"Row {idx}: Duplicate roles at positions {i-1},{i}")
26            
27            # Check 3: Token count within limits
28            full_text = "\n".join(m["content"] for m in messages)
29            tokens = self.tokenizer.encode(full_text)
30            if len(tokens) > self.max_length:
31                issues.append(f"Row {idx}: {len(tokens)} tokens > {self.max_length} limit")
32            
33            # Check 4: Assistant responses aren't empty
34            assistant_msgs = [m for m in messages if m["role"] == "assistant"]
35            if not all(len(m["content"]) > 20 for m in assistant_msgs):
36                issues.append(f"Row {idx}: Short assistant response detected")
37        
38        return issues

Data Augmentation for Small Datasets

If you have <1000 examples, augment intelligently:

python
1class DataAugmenter:
2    def paraphrase_user_queries(self, dataset, num_variations=3):
3        """Generate semantic variations of user prompts"""
4        
5        paraphrase_prompt = """Generate {n} different ways a user might ask this question.
6        Keep the intent identical but vary phrasing, formality, and length:
7        
8        Original: {original}
9        
10        Variations:"""
11        
12        augmented = []
13        for example in dataset:
14            user_msg = next(m for m in example["messages"] if m["role"] == "user")
15            
16            variations = self.model.generate(
17                paraphrase_prompt.format(n=num_variations, original=user_msg["content"])
18            )
19            
20            for variation in variations:
21                new_example = deepcopy(example)
22                user_msg_idx = next(i for i, m in enumerate(new_example["messages"]) if m["role"] == "user")
23                new_example["messages"][user_msg_idx]["content"] = variation
24                augmented.append(new_example)
25        
26        return dataset + augmented

Fine-Tuning Implementations

Recipe 1: QLoRA with Unsloth (Fastest)

python
1# unsloth_trainer.py
2from unsloth import FastLanguageModel
3from trl import SFTTrainer
4from transformers import TrainingArguments
5from datasets import load_dataset
6
7# Load Gemma 4 with 4-bit quantization
8model, tokenizer = FastLanguageModel.from_pretrained(
9    model_name="google/gemma-4-9b",
10    max_seq_length=8192,
11    load_in_4bit=True,
12    fast_inference=True,
13)
14
15# Add LoRA adapters
16model = FastLanguageModel.get_peft_model(
17    model,
18    r=64,              # LoRA rank (higher = more capacity)
19    lora_alpha=128,    # Scaling factor (typically 2x rank)
20    target_modules=[
21        "q_proj", "k_proj", "v_proj", "o_proj",
22        "gate_proj", "up_proj", "down_proj",
23    ],
24    lora_dropout=0.05,
25    bias="none",
26    use_gradient_checkpointing="unsloth",
27    random_state=42,
28)
29
30# Prepare dataset
31dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
32
33# Training configuration
34training_args = TrainingArguments(
35    output_dir="./gemma-4-9b-finetuned",
36    num_train_epochs=3,
37    per_device_train_batch_size=2,
38    gradient_accumulation_steps=4,
39    warmup_steps=100,
40    learning_rate=2e-4,
41    optim="adamw_8bit",
42    weight_decay=0.01,
43    lr_scheduler_type="cosine",
44    seed=42,
45    report_to="wandb",
46)
47
48# Initialize trainer
49trainer = SFTTrainer(
50    model=model,
51    tokenizer=tokenizer,
52    train_dataset=dataset,
53    dataset_text_field="messages",
54    max_seq_length=8192,
55    args=training_args,
56)
57
58# Train
59trainer.train()
60
61# Save adapters
62model.save_pretrained("./gemma-4-9b-lora")
63tokenizer.save_pretrained("./gemma-4-9b-lora")

Performance on RTX 4090 (24GB):

  • 9B model: ~45 minutes per epoch (10K samples)
  • 27B model: ~2.5 hours per epoch (with gradient checkpointing)

Recipe 2: Multi-GPU Full Fine-tuning (DeepSpeed)

For domains where LoRA isn't enough (medical imaging reports, legal contracts):

python
1# deepspeed_config.json
2{
3    "fp16": {
4        "enabled": true,
5        "loss_scale": 0,
6        "loss_scale_window": 1000,
7        "initial_scale_power": 16,
8        "hysteresis": 2,
9        "min_loss_scale": 1
10    },
11    "zero_optimization": {
12        "stage": 2,
13        "offload_optimizer": {
14            "device": "cpu",
15            "pin_memory": true
16        },
17        "allgather_partitions": true,
18        "allgather_bucket_size": 2e8,
19        "overlap_comm": true,
20        "reduce_scatter": true,
21        "reduce_bucket_size": 2e8,
22        "contiguous_gradients": true
23    },
24    "train_batch_size": "auto",
25    "train_micro_batch_size_per_gpu": "auto",
26    "gradient_accumulation_steps": "auto"
27}
python
1# full_finetune.py
2from accelerate import Accelerator
3from transformers import AutoModelForCausalLM, TrainingArguments
4from trl import SFTTrainer
5
6accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin)
7
8model = AutoModelForCausalLM.from_pretrained(
9    "google/gemma-4-27b",
10    torch_dtype=torch.bfloat16,
11)
12
13# DeepSpeed ZeRO-2 splits optimizer states across GPUs
14# 4x A100 80GB can full fine-tune 27B model
15training_args = TrainingArguments(
16    output_dir="./gemma-4-27b-full",
17    deepspeed="./deepspeed_config.json",
18    num_train_epochs=2,
19    per_device_train_batch_size=1,
20    gradient_accumulation_steps=8,
21    learning_rate=1e-5,  # Lower LR for full fine-tuning
22    warmup_ratio=0.03,
23    lr_scheduler_type="cosine",
24    save_strategy="epoch",
25    logging_steps=10,
26)
27
28trainer = SFTTrainer(
29    model=model,
30    train_dataset=dataset,
31    args=training_args,
32)
33
34trainer.train()

Recipe 3: Function Calling Fine-tune

If you're building agents, fine-tune specifically for tool use:

python
1# Prepare tool-use training data
2{
3    "messages": [
4        {"role": "user", "content": "Book me a flight to Tokyo next Tuesday"},
5        {"role": "assistant", "content": "", "tool_calls": [
6            {"name": "search_flights", "arguments": {"destination": "Tokyo", "date": "2026-05-12"}}
7        ]},
8        {"role": "tool", "name": "search_flights", "content": "[...flight data...]"},
9        {"role": "assistant", "content": "I found 3 flights... Which would you prefer?"}
10    ]
11}
12
13# Training with tool schemas embedded
14tool_schema = json.dumps([{
15    "type": "function",
16    "function": {
17        "name": "search_flights",
18        "parameters": {...}
19    }
20}])
21
22# Prepend schema to system prompt
23system_msg = f"You have access to these tools: {tool_schema}\nUse them when appropriate."

Hyperparameter Tuning

Learning Rate Finder

python
1from torch.optim import AdamW
2import matplotlib.pyplot as plt
3
4class LRFinder:
5    def find(self, model, dataset, min_lr=1e-6, max_lr=1e-3, num_steps=100):
6        optimizer = AdamW(model.parameters(), lr=min_lr)
7        
8        # Exponential LR increase
9        lr_mult = (max_lr / min_lr) ** (1 / num_steps)
10        lr = min_lr
11        
12        losses = []
13        lrs = []
14        
15        for step, batch in enumerate(dataset):
16            if step >= num_steps:
17                break
18                
19            optimizer.param_groups[0]["lr"] = lr
20            
21            loss = model(**batch).loss
22            loss.backward()
23            optimizer.step()
24            optimizer.zero_grad()
25            
26            losses.append(loss.item())
27            lrs.append(lr)
28            
29            lr *= lr_mult
30        
31        # Plot and identify steepest descent
32        plt.plot(lrs, losses)
33        plt.xscale("log")
34        plt.xlabel("Learning Rate")
35        plt.ylabel("Loss")
36        
37        # Optimal LR is typically 10x lower than minimum loss point
38        optimal_idx = losses.index(min(losses))
39        optimal_lr = lrs[optimal_idx] / 10
40        
41        return optimal_lr

Optimal Configurations by Task

TaskModelRankAlphaLREpochsBatch Size
Classification9B8163e-438
Chat/QA9B641282e-434
Code generation27B1282561e-452
Function calling9B32642e-444
Summarization27B16322e-424

Evaluation That Actually Predicts Production Performance

Automated Evaluation Pipeline

python
1from deepeval import evaluate
2from deepeval.metrics import GEval, FaithfulnessMetric
3from deepeval.test_case import LLMTestCase
4
5class ProductionEvaluator:
6    def __init__(self, base_model, finetuned_model):
7        self.base = base_model
8        self.finetuned = finetuned_model
9        self.metrics = [
10            GEval(
11                name="Correctness",
12                criteria="Determine if the response is factually correct",
13                evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
14            ),
15            FaithfulnessMetric(threshold=0.7),
16        ]
17    
18    def evaluate_on_test_set(self, test_cases: list):
19        """Compare base vs fine-tuned on identical inputs"""
20        
21        results = []
22        for case in test_cases:
23            base_output = self.base.generate(case["input"])
24            finetuned_output = self.finetuned.generate(case["input"])
25            
26            test_case = LLMTestCase(
27                input=case["input"],
28                actual_output=finetuned_output,
29                expected_output=case["expected"],
30                retrieval_context=case.get("context", [])
31            )
32            
33            scores = evaluate([test_case], self.metrics)
34            
35            results.append({
36                "input": case["input"][:100],
37                "base_score": self.score_output(base_output, case["expected"]),
38                "finetuned_score": scores[0].score,
39                "improvement": scores[0].score - self.score_output(base_output, case["expected"]),
40                "regression": scores[0].score < 0.5
41            })
42        
43        return results

Regression Testing

python
1class RegressionTester:
2    def test_for_catastrophic_forgetting(self, base_model, finetuned_model):
3        """Ensure model didn't forget general capabilities"""
4        
5        general_tasks = [
6            "Explain quantum computing in simple terms",
7            "Write a Python function to reverse a string",
8            "What is the capital of Australia?",
9            "Summarize the theory of relativity"
10        ]
11        
12        regressions = []
13        for task in general_tasks:
14            base_output = base_model.generate(task)
15            finetuned_output = finetuned_model.generate(task)
16            
17            similarity = self.semantic_similarity(base_output, finetuned_output)
18            
19            if similarity < 0.85:  # Significant divergence
20                regressions.append({
21                    "task": task,
22                    "base": base_output[:200],
23                    "finetuned": finetuned_output[:200],
24                    "similarity": similarity
25                })
26        
27        return regressions

Deployment of Fine-Tuned Models

Merge Adapters for Production

python
1from peft import AutoPeftModelForCausalLM
2
3# Load fine-tuned model (base + adapters)
4model = AutoPeftModelForCausalLM.from_pretrained(
5    "./gemma-4-9b-finetuned",
6    device_map="auto"
7)
8
9# Merge LoRA weights into base model for single-file deployment
10merged_model = model.merge_and_unload()
11merged_model.save_pretrained("./gemma-4-9b-merged")
12
13# Result: single 18GB file, no adapter loading overhead
14# 15-20% faster inference than adapter mode

vLLM Deployment with Fine-Tuned Model

bash
1# Serve merged model with vLLM
2python -m vllm.entrypoints.openai.api_server \
3    --model ./gemma-4-9b-merged \
4    --served-model-name gemma-4-customer-support \
5    --max-model-len 16384 \
6    --quantization fp8 \
7    --gpu-memory-utilization 0.92
8
9# Test
10curl http://localhost:8000/v1/chat/completions \
11  -H "Content-Type: application/json" \
12  -d '{
13    "model": "gemma-4-customer-support",
14    "messages": [{"role": "user", "content": "My database is slow"}]
15  }'

Monitoring Fine-Tuned Models in Production

python
1from prometheus_client import Counter, Histogram, Gauge
2import time
3
4class ProductionMonitor:
5    def __init__(self):
6        self.inference_latency = Histogram(
7            "gemma4_inference_latency_seconds",
8            "End-to-end inference time",
9            ["model_version", "quantization"]
10        )
11        self.token_throughput = Gauge(
12            "gemma4_tokens_per_second",
13            "Generation speed",
14            ["model_version"]
15        )
16        self.drift_detector = DriftDetector(window_size=1000)
17    
18    def monitor_inference(self, model_version, quantization, prompt, output):
19        # Track latency
20        start = time.time()
21        result = self.model.generate(prompt)
22        latency = time.time() - start
23        
24        self.inference_latency.labels(
25            model_version=model_version,
26            quantization=quantization
27        ).observe(latency)
28        
29        # Detect output drift
30        self.drift_detector.add_sample(output)
31        if self.drift_detector.detect_drift():
32            self.alert_engineer("Output distribution drift detected!")
33        
34        # Track hallucination indicators
35        if self.contains_unexpected_patterns(output):
36            self.log_suspicious_output(prompt, output)
37    
38    def contains_unexpected_patterns(self, text: str) -> bool:
39        """Detect potential fine-tuning degradation"""
40        indicators = [
41            r"I'm sorry, but I",  # Generic refusal patterns returning
42            r"As an AI language model",  # Base model leakage
43            r"\[object Object\]",  # JSON formatting errors
44            r"undefined|null",  # Code generation regression
45        ]
46        return any(re.search(pattern, text) for pattern in indicators)

The Bottom Line

Fine-tuning Gemma 4 in 2026 is remarkably accessible:

  1. QLoRA on consumer GPUs (RTX 4090): 9B models in under an hour
  2. Unsloth: 2x faster training, 50% less memory than standard PEFT
  3. DeepSpeed: Full fine-tuning of 27B models on 4x A100s
  4. Apache 2.0: Ship your fine-tuned model without legal review

The key differentiator isn't the model—it's your data quality and evaluation rigor. A 9B model fine-tuned on 5K high-quality examples beats a 27B model on generic data.

Start with QLoRA. Validate with regression tests. Deploy merged models. Monitor for drift.

That's the production recipe.


Essa Mamdani is the creator of AutoBlogging.Pro and has fine-tuned more language models than he cares to count.

Follow: essa.mamdani.com | GitHub: @essamamdani

#AI#Gemma 4#Fine-tuning#ML Engineering