February 19, 2026 · 15 min
Step-by-Step Guide to LoRA Fine-Tuning on Mistral Models
A complete, practical guide to fine-tuning Mistral-7B using LoRA adapters — from environment setup through dataset preparation, training, and running your fine-tuned model locally.
Step-by-Step Guide to LoRA Fine-Tuning on Mistral Models
TL;DR: This is the guide I wish I had when I started fine-tuning LLMs. No theory fluff — just the exact steps, code, and commands to fine-tune Mistral-7B using LoRA and run it locally with Ollama.
Prerequisites
Before starting, you need:
- Python 3.10+ (3.11 recommended)
- GPU access — at minimum an A10G (24GB VRAM) on a cloud instance, ideally an A100 (40GB)
- HuggingFace account — for model downloads and dataset uploads
- ~50GB disk space — for model weights and training artifacts
For cloud GPU access, I use Lambda Labs (~$1.10/hr for A10G) or Vast.ai for cost efficiency.
Step 1: Environment Setup
# Create virtual environment
python -m venv lora-env
source lora-env/bin/activate
# Install dependencies
pip install -q \
transformers==4.38.2 \
peft==0.10.0 \
trl==0.8.1 \
datasets==2.18.0 \
bitsandbytes==0.43.0 \
accelerate==0.28.0 \
einops \
scipy
# Login to HuggingFace (needed for gated models)
huggingface-cli login
Verify GPU access:
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # e.g. "NVIDIA A100-SXM4-40GB"
print(torch.cuda.get_device_properties(0).total_memory / 1e9) # ~40.0
Step 2: Prepare Your Dataset
Your dataset should be a JSONL file with instruction-response pairs. Each line is one training sample.
Format
{"instruction": "Explain async/await in TypeScript.", "input": "", "output": "Async/await in TypeScript is syntactic sugar over Promises..."}
{"instruction": "Write a Prisma schema for a User with posts.", "input": "", "output": "model User {\n id String @id @default(cuid())\n posts Post[]\n}\n\nmodel Post {\n id String @id @default(cuid())\n userId String\n user User @relation(fields: [userId], references: [id])\n}"}
Creating the Dataset
If you're generating synthetic data with GPT-4o (as a labeler):
from openai import OpenAI
import json
client = OpenAI()
def generate_training_sample(topic: str) -> dict:
"""Generate a single instruction-response training sample."""
# First generate an instruction
instruction_response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': f'Generate a programming instruction/question specifically about {topic}. Return just the question.'
}]
)
instruction = instruction_response.choices[0].message.content
# Then generate a high-quality answer
answer_response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': f'{instruction}\n\nProvide a detailed, expert-level answer with code examples where appropriate.'
}]
)
answer = answer_response.choices[0].message.content
return {'instruction': instruction, 'input': '', 'output': answer}
# Generate 2000 samples
topics = ['TypeScript generics', 'Next.js App Router', 'Prisma ORM', 'React hooks', ...]
samples = [generate_training_sample(topic) for topic in topics * 40]
with open('training_data.jsonl', 'w') as f:
for sample in samples:
f.write(json.dumps(sample) + '\n')
Formatting for Training
Mistral uses the [INST] prompt format:
def format_instruction(sample: dict) -> str:
"""Convert to Mistral instruction format."""
if sample['input']:
return f"<s>[INST] {sample['instruction']}\n\nContext: {sample['input']} [/INST] {sample['output']}</s>"
return f"<s>[INST] {sample['instruction']} [/INST] {sample['output']}</s>"
# Apply formatting
from datasets import load_dataset
dataset = load_dataset('json', data_files='training_data.jsonl', split='train')
dataset = dataset.map(lambda x: {'text': format_instruction(x)})
# Split into train/test
dataset = dataset.train_test_split(test_size=0.05, seed=42)
print(f"Train: {len(dataset['train'])} samples")
print(f"Test: {len(dataset['test'])} samples")
Step 3: Load the Base Model with Quantization
4-bit quantization (QLoRA) reduces VRAM requirements from ~28GB to ~6GB:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
# QLoRA config: 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto", # Automatically distribute across GPUs
trust_remote_code=True,
)
model.config.use_cache = False # Disable KV cache (not needed for training)
model.config.pretraining_tp = 1
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Mistral doesn't have a pad token
tokenizer.padding_side = "right"
print(f"Model loaded. Parameters: {model.num_parameters() / 1e9:.1f}B")
# Model loaded. Parameters: 7.2B
Step 4: Configure LoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
# Prepare model for LoRA (handles quantization layers)
model = prepare_model_for_kbit_training(model)
# LoRA configuration
peft_config = LoraConfig(
r=16, # Rank: higher = more capacity but more memory
lora_alpha=32, # Scaling factor (usually 2x rank)
lora_dropout=0.05, # Dropout for regularization
bias="none",
task_type=TaskType.CAUSAL_LM,
target_modules=[ # Which attention layers to adapt
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
]
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 3,773,554,688 || trainable%: 0.5556
Step 5: Training
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./mistral-lora-finetuned",
# Training dynamics
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
gradient_checkpointing=True, # Save VRAM at cost of speed
# Optimizer
optim="paged_adamw_32bit", # More VRAM-efficient than AdamW
learning_rate=2e-4,
weight_decay=0.001,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
# Mixed precision
fp16=False,
bf16=True, # Use bfloat16 on A100 for stability
# Logging & saving
logging_steps=25,
save_steps=500,
save_total_limit=3,
evaluation_strategy="steps",
eval_steps=500,
# Misc
max_grad_norm=0.3,
group_by_length=True, # Group similar-length samples (faster training)
report_to="tensorboard",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_args,
packing=False, # Set True for faster training if samples are short
)
# Start training!
trainer.train()
# Save the fine-tuned adapter
trainer.save_model("./mistral-lora-adapter")
tokenizer.save_pretrained("./mistral-lora-adapter")
Monitoring Training
# In a separate terminal
tensorboard --logdir ./mistral-lora-finetuned/runs
# Open http://localhost:6006
Watch for:
- Training loss decreasing — should go from ~2.5 to ~0.8 over 3 epochs
- Eval loss not diverging — if eval loss goes up while train loss goes down, you're overfitting
- Learning rate curve — should follow cosine decay
Step 6: Test Your Fine-tuned Model
from peft import PeftModel
from transformers import GenerationConfig
# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(base_model, "./mistral-lora-adapter")
def generate_response(instruction: str, max_new_tokens=512) -> str:
prompt = f"<s>[INST] {instruction} [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode and strip the prompt from the response
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
return full_output[len(prompt):].strip()
# Test it!
response = generate_response("Write a Prisma schema for an e-commerce product with variants.")
print(response)
Step 7: Export for Local Use with Ollama
# Step 1: Merge LoRA weights into base model
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
print('Loading base model...')
base_model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1', torch_dtype=torch.float16)
print('Loading adapter...')
model = PeftModel.from_pretrained(base_model, './mistral-lora-adapter')
print('Merging weights...')
model = model.merge_and_unload()
print('Saving merged model...')
model.save_pretrained('./mistral-merged')
tokenizer = AutoTokenizer.from_pretrained('./mistral-lora-adapter')
tokenizer.save_pretrained('./mistral-merged')
print('Done!')
"
# Step 2: Clone llama.cpp for GGUF conversion
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Step 3: Convert to GGUF format
python convert.py ../mistral-merged \
--outtype f16 \
--outfile ../mistral-finetuned-f16.gguf
# Step 4: Quantize to Q4_K_M (best quality/size tradeoff)
./quantize ../mistral-finetuned-f16.gguf ../mistral-finetuned-q4km.gguf Q4_K_M
# Step 5: Create Ollama Modelfile
cat > Modelfile << 'EOF'
FROM ./mistral-finetuned-q4km.gguf
TEMPLATE """<s>[INST] {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
EOF
# Step 6: Register with Ollama
ollama create mistral-custom -f Modelfile
# Step 7: Test!
ollama run mistral-custom "Write a Next.js API route for user authentication"
Common Issues and Fixes
| Issue | Fix |
|-------|-----|
| CUDA out of memory | Reduce per_device_train_batch_size to 2, increase gradient_accumulation_steps to 8 |
| Training loss not decreasing | Check dataset formatting — most common cause |
| Model gives generic responses | Dataset quality issue — add more domain-specific samples |
| Quantization errors | Use torch_dtype=torch.float16 consistently |
| Slow training | Enable packing=True if your samples are short; use group_by_length=True |
Expected Timeline
| Phase | Time | |-------|------| | Environment setup | 30 min | | Dataset prep (2000 samples) | 3–6 hours | | Training (3 epochs, A100) | 3–5 hours | | Evaluation & testing | 1–2 hours | | GGUF conversion + Ollama setup | 30 min | | Total | ~12 hours |
Related post: Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks All code tested on: Python 3.11, transformers 4.38, peft 0.10, CUDA 12.1