← Back to tech insights

February 19, 2026 · 12 min

Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks

A deep dive into how I fine-tuned Mistral-7B using LoRA adapters to power domain-specific code assistance inside JarvisX — without expensive cloud inference.

Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks

TL;DR: Off-the-shelf LLMs are generalists. For JarvisX, I needed a model that thought like a software engineer — understanding project context, code conventions, and toolchains without being told every time. Here's how I fine-tuned Mistral-7B using LoRA to get there.


The Problem with General-Purpose LLMs

When I started building JarvisX — my local AI development assistant — I initially used Mistral-7B straight out of the box via Ollama. The results were decent for generic tasks, but the model constantly struggled with:

  • Project-specific naming conventions — it couldn't retain or infer my coding style
  • Framework-specific patterns — Next.js App Router, Prisma schema conventions, and TypeScript generics were often handled incorrectly
  • Engineering vocabulary — when I asked about system architecture decisions, responses felt too generic

The fix? Fine-tune the model on engineering-focused datasets, specifically curated around the kinds of tasks I actually do daily.


Why Mistral-7B?

Before choosing a base model, I evaluated several options:

| Model | Parameters | Context Window | Local Inference Speed | |-------|-----------|---------------|----------------------| | LLaMA 2 | 7B | 4K | Fast | | Mistral-7B | 7B | 8K | Very Fast | | CodeLlama | 7B–34B | 16K | Medium | | Phi-2 | 2.7B | 2K | Extremely Fast |

Mistral-7B won for three reasons:

  1. Sliding Window Attention — efficient for longer code context
  2. Strong base reasoning — outperforms LLaMA 2 13B on most benchmarks at 7B
  3. Excellent community support — HuggingFace ecosystem, GGUF quantization support via llama.cpp

The Fine-tuning Approach: LoRA

Full fine-tuning a 7B parameter model requires 80GB+ VRAM. That's not realistic on consumer hardware. Instead, I used LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that:

  • Freezes the original model weights
  • Inserts small trainable adapter matrices at each attention layer
  • Reduces trainable parameters from ~7B down to ~4–8M
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,                        # Rank of the adapter matrices
    lora_alpha=32,               # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.1118

Dataset Curation

This is where most guides skip over the hard part. Dataset quality matters more than any hyperparameter tuning.

Sources I Used

  1. The Stack (Code subset) — filtered for TypeScript, Python, JavaScript
  2. Synthetic Q&A pairs — I generated 2,000+ instruction-response pairs from my own projects using GPT-4o as a labeler
  3. Architecture decisions — manually written JSONL pairs covering system design scenarios I commonly face

Format: Instruction Tuning

{
  "instruction": "Explain the best approach to implement multi-tenant database isolation in a Next.js SaaS app using Prisma and PostgreSQL.",
  "input": "",
  "output": "For multi-tenant SaaS with Prisma and PostgreSQL, you have three main strategies:\n\n1. **Row-Level Security (RLS)** — Add a `tenantId` column to every table and use PostgreSQL RLS policies...\n2. **Schema-per-tenant** — Each tenant gets their own PostgreSQL schema...\n3. **Database-per-tenant** — The most isolated but most expensive approach...\n\nFor most SaaS apps at early scale, RLS with Prisma middleware is the sweet spot..."
}

Training Setup

I trained on a cloud GPU instance (A100 40GB) for cost efficiency while keeping local inference post-training.

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./mistral-7b-engineering-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    optim="paged_adamw_32bit",
    lr_scheduler_type="cosine",
    save_strategy="epoch",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

Training stats:

  • Duration: ~4.5 hours on A100 40GB
  • Final training loss: 0.87
  • Dataset size: ~18,000 samples after deduplication

Converting for Local Inference

After training, I merged the LoRA adapters into the base model and converted to GGUF for llama.cpp / Ollama:

# Merge adapters
python merge_peft_model.py \
  --base_model mistralai/Mistral-7B-v0.1 \
  --peft_model ./mistral-7b-engineering-lora \
  --output_dir ./merged_model

# Convert to GGUF
python llama.cpp/convert.py ./merged_model \
  --outtype q4_k_m \
  --outfile mistral-engineering-q4.gguf

The resulting Q4_K_M quantized model runs at ~35 tokens/second on Apple M2 Pro — fast enough for real-time code assistance.


Results: Before vs After

| Task | Base Mistral-7B | Fine-tuned | |------|----------------|------------| | Prisma schema generation | Generic, often wrong | Accurate, with relations | | Next.js App Router patterns | Outdated (Pages Router style) | Correct App Router syntax | | Architecture recommendations | Too broad | Project-specific, opinionated | | TypeScript generic inference | Moderate | Strong |


Lessons Learned

  1. Garbage in, garbage out — I spent 60% of total time on dataset curation. Worth every minute.
  2. Start with rank 8, not 16 — Lower rank = less overfitting on small datasets. I had to retrain once.
  3. Perplexity isn't everything — Always test on real prompts from your actual use case.
  4. Quantization trade-offs — Q4_K_M gives the best quality/speed balance for 7B models. Q5 adds ~15% quality improvement at ~20% speed cost.

What's Next

The fine-tuned model now powers the code assistance engine inside JarvisX. Next, I'm experimenting with:

  • DPO (Direct Preference Optimization) to make the model prefer concise, well-structured code over verbose responses
  • RAG integration to give the model access to project-specific documentation at inference time

If you're building local AI tools, fine-tuning is genuinely accessible now. You don't need a research lab — just good data and a clear use case.


Found this useful? Check out JarvisX on GitHub or my portfolio for more projects.