February 19, 2026 · 12 min
Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks
A deep dive into how I fine-tuned Mistral-7B using LoRA adapters to power domain-specific code assistance inside JarvisX — without expensive cloud inference.
Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks
TL;DR: Off-the-shelf LLMs are generalists. For JarvisX, I needed a model that thought like a software engineer — understanding project context, code conventions, and toolchains without being told every time. Here's how I fine-tuned Mistral-7B using LoRA to get there.
The Problem with General-Purpose LLMs
When I started building JarvisX — my local AI development assistant — I initially used Mistral-7B straight out of the box via Ollama. The results were decent for generic tasks, but the model constantly struggled with:
- Project-specific naming conventions — it couldn't retain or infer my coding style
- Framework-specific patterns — Next.js App Router, Prisma schema conventions, and TypeScript generics were often handled incorrectly
- Engineering vocabulary — when I asked about system architecture decisions, responses felt too generic
The fix? Fine-tune the model on engineering-focused datasets, specifically curated around the kinds of tasks I actually do daily.
Why Mistral-7B?
Before choosing a base model, I evaluated several options:
| Model | Parameters | Context Window | Local Inference Speed | |-------|-----------|---------------|----------------------| | LLaMA 2 | 7B | 4K | Fast | | Mistral-7B | 7B | 8K | Very Fast | | CodeLlama | 7B–34B | 16K | Medium | | Phi-2 | 2.7B | 2K | Extremely Fast |
Mistral-7B won for three reasons:
- Sliding Window Attention — efficient for longer code context
- Strong base reasoning — outperforms LLaMA 2 13B on most benchmarks at 7B
- Excellent community support — HuggingFace ecosystem, GGUF quantization support via llama.cpp
The Fine-tuning Approach: LoRA
Full fine-tuning a 7B parameter model requires 80GB+ VRAM. That's not realistic on consumer hardware. Instead, I used LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that:
- Freezes the original model weights
- Inserts small trainable adapter matrices at each attention layer
- Reduces trainable parameters from ~7B down to ~4–8M
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=16, # Rank of the adapter matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.1118
Dataset Curation
This is where most guides skip over the hard part. Dataset quality matters more than any hyperparameter tuning.
Sources I Used
- The Stack (Code subset) — filtered for TypeScript, Python, JavaScript
- Synthetic Q&A pairs — I generated 2,000+ instruction-response pairs from my own projects using GPT-4o as a labeler
- Architecture decisions — manually written JSONL pairs covering system design scenarios I commonly face
Format: Instruction Tuning
{
"instruction": "Explain the best approach to implement multi-tenant database isolation in a Next.js SaaS app using Prisma and PostgreSQL.",
"input": "",
"output": "For multi-tenant SaaS with Prisma and PostgreSQL, you have three main strategies:\n\n1. **Row-Level Security (RLS)** — Add a `tenantId` column to every table and use PostgreSQL RLS policies...\n2. **Schema-per-tenant** — Each tenant gets their own PostgreSQL schema...\n3. **Database-per-tenant** — The most isolated but most expensive approach...\n\nFor most SaaS apps at early scale, RLS with Prisma middleware is the sweet spot..."
}
Training Setup
I trained on a cloud GPU instance (A100 40GB) for cost efficiency while keeping local inference post-training.
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./mistral-7b-engineering-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
optim="paged_adamw_32bit",
lr_scheduler_type="cosine",
save_strategy="epoch",
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
peft_config=lora_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_args,
)
trainer.train()
Training stats:
- Duration: ~4.5 hours on A100 40GB
- Final training loss: 0.87
- Dataset size: ~18,000 samples after deduplication
Converting for Local Inference
After training, I merged the LoRA adapters into the base model and converted to GGUF for llama.cpp / Ollama:
# Merge adapters
python merge_peft_model.py \
--base_model mistralai/Mistral-7B-v0.1 \
--peft_model ./mistral-7b-engineering-lora \
--output_dir ./merged_model
# Convert to GGUF
python llama.cpp/convert.py ./merged_model \
--outtype q4_k_m \
--outfile mistral-engineering-q4.gguf
The resulting Q4_K_M quantized model runs at ~35 tokens/second on Apple M2 Pro — fast enough for real-time code assistance.
Results: Before vs After
| Task | Base Mistral-7B | Fine-tuned | |------|----------------|------------| | Prisma schema generation | Generic, often wrong | Accurate, with relations | | Next.js App Router patterns | Outdated (Pages Router style) | Correct App Router syntax | | Architecture recommendations | Too broad | Project-specific, opinionated | | TypeScript generic inference | Moderate | Strong |
Lessons Learned
- Garbage in, garbage out — I spent 60% of total time on dataset curation. Worth every minute.
- Start with rank 8, not 16 — Lower rank = less overfitting on small datasets. I had to retrain once.
- Perplexity isn't everything — Always test on real prompts from your actual use case.
- Quantization trade-offs — Q4_K_M gives the best quality/speed balance for 7B models. Q5 adds ~15% quality improvement at ~20% speed cost.
What's Next
The fine-tuned model now powers the code assistance engine inside JarvisX. Next, I'm experimenting with:
- DPO (Direct Preference Optimization) to make the model prefer concise, well-structured code over verbose responses
- RAG integration to give the model access to project-specific documentation at inference time
If you're building local AI tools, fine-tuning is genuinely accessible now. You don't need a research lab — just good data and a clear use case.
Found this useful? Check out JarvisX on GitHub or my portfolio for more projects.