Step-by-Step Guide to LoRA Fine-Tuning on Mistral Models

TL;DR: This is the guide I wish I had when I started fine-tuning LLMs. No theory fluff — just the exact steps, code, and commands to fine-tune Mistral-7B using LoRA and run it locally with Ollama.

Prerequisites

Before starting, you need:

Python 3.10+ (3.11 recommended)
GPU access — at minimum an A10G (24GB VRAM) on a cloud instance, ideally an A100 (40GB)
HuggingFace account — for model downloads and dataset uploads
~50GB disk space — for model weights and training artifacts

For cloud GPU access, I use Lambda Labs (~$1.10/hr for A10G) or Vast.ai for cost efficiency.

Step 1: Environment Setup

# Create virtual environment
python -m venv lora-env
source lora-env/bin/activate

# Install dependencies
pip install -q \
  transformers==4.38.2 \
  peft==0.10.0 \
  trl==0.8.1 \
  datasets==2.18.0 \
  bitsandbytes==0.43.0 \
  accelerate==0.28.0 \
  einops \
  scipy

# Login to HuggingFace (needed for gated models)
huggingface-cli login

Verify GPU access:

import torch
print(torch.cuda.is_available())      # True
print(torch.cuda.get_device_name(0))  # e.g. "NVIDIA A100-SXM4-40GB"
print(torch.cuda.get_device_properties(0).total_memory / 1e9)  # ~40.0

Step 2: Prepare Your Dataset

Your dataset should be a JSONL file with instruction-response pairs. Each line is one training sample.

Format

{"instruction": "Explain async/await in TypeScript.", "input": "", "output": "Async/await in TypeScript is syntactic sugar over Promises..."}
{"instruction": "Write a Prisma schema for a User with posts.", "input": "", "output": "model User {\n  id    String @id @default(cuid())\n  posts Post[]\n}\n\nmodel Post {\n  id     String @id @default(cuid())\n  userId String\n  user   User @relation(fields: [userId], references: [id])\n}"}

Creating the Dataset

If you're generating synthetic data with GPT-4o (as a labeler):

from openai import OpenAI
import json

client = OpenAI()

def generate_training_sample(topic: str) -> dict:
    """Generate a single instruction-response training sample."""
    
    # First generate an instruction
    instruction_response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{
            'role': 'user',
            'content': f'Generate a programming instruction/question specifically about {topic}. Return just the question.'
        }]
    )
    instruction = instruction_response.choices[0].message.content
    
    # Then generate a high-quality answer
    answer_response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{
            'role': 'user',
            'content': f'{instruction}\n\nProvide a detailed, expert-level answer with code examples where appropriate.'
        }]
    )
    answer = answer_response.choices[0].message.content
    
    return {'instruction': instruction, 'input': '', 'output': answer}

# Generate 2000 samples
topics = ['TypeScript generics', 'Next.js App Router', 'Prisma ORM', 'React hooks', ...]
samples = [generate_training_sample(topic) for topic in topics * 40]

with open('training_data.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

Formatting for Training

Mistral uses the [INST] prompt format:

def format_instruction(sample: dict) -> str:
    """Convert to Mistral instruction format."""
    if sample['input']:
        return f"<s>[INST] {sample['instruction']}\n\nContext: {sample['input']} [/INST] {sample['output']}</s>"
    return f"<s>[INST] {sample['instruction']} [/INST] {sample['output']}</s>"

# Apply formatting
from datasets import load_dataset

dataset = load_dataset('json', data_files='training_data.jsonl', split='train')
dataset = dataset.map(lambda x: {'text': format_instruction(x)})

# Split into train/test
dataset = dataset.train_test_split(test_size=0.05, seed=42)
print(f"Train: {len(dataset['train'])} samples")
print(f"Test:  {len(dataset['test'])} samples")

Step 3: Load the Base Model with Quantization

4-bit quantization (QLoRA) reduces VRAM requirements from ~28GB to ~6GB:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_NAME = "mistralai/Mistral-7B-v0.1"

# QLoRA config: 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",          # Automatically distribute across GPUs
    trust_remote_code=True,
)
model.config.use_cache = False  # Disable KV cache (not needed for training)
model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Mistral doesn't have a pad token
tokenizer.padding_side = "right"

print(f"Model loaded. Parameters: {model.num_parameters() / 1e9:.1f}B")
# Model loaded. Parameters: 7.2B

Step 4: Configure LoRA

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Prepare model for LoRA (handles quantization layers)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
peft_config = LoraConfig(
    r=16,                              # Rank: higher = more capacity but more memory
    lora_alpha=32,                     # Scaling factor (usually 2x rank)
    lora_dropout=0.05,                 # Dropout for regularization
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=[                   # Which attention layers to adapt
        "q_proj",
        "k_proj", 
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ]
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 3,773,554,688 || trainable%: 0.5556

Step 5: Training

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./mistral-lora-finetuned",
    
    # Training dynamics
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # Effective batch size = 4 * 4 = 16
    gradient_checkpointing=True,        # Save VRAM at cost of speed
    
    # Optimizer
    optim="paged_adamw_32bit",          # More VRAM-efficient than AdamW
    learning_rate=2e-4,
    weight_decay=0.001,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    
    # Mixed precision
    fp16=False,
    bf16=True,                          # Use bfloat16 on A100 for stability
    
    # Logging & saving
    logging_steps=25,
    save_steps=500,
    save_total_limit=3,
    evaluation_strategy="steps",
    eval_steps=500,
    
    # Misc
    max_grad_norm=0.3,
    group_by_length=True,              # Group similar-length samples (faster training)
    report_to="tensorboard",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_args,
    packing=False,                     # Set True for faster training if samples are short
)

# Start training!
trainer.train()

# Save the fine-tuned adapter
trainer.save_model("./mistral-lora-adapter")
tokenizer.save_pretrained("./mistral-lora-adapter")

Monitoring Training

# In a separate terminal
tensorboard --logdir ./mistral-lora-finetuned/runs
# Open http://localhost:6006

Watch for:

Training loss decreasing — should go from ~2.5 to ~0.8 over 3 epochs
Eval loss not diverging — if eval loss goes up while train loss goes down, you're overfitting
Learning rate curve — should follow cosine decay

Step 6: Test Your Fine-tuned Model

from peft import PeftModel
from transformers import GenerationConfig

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, 
    device_map="auto",
    torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(base_model, "./mistral-lora-adapter")

def generate_response(instruction: str, max_new_tokens=512) -> str:
    prompt = f"<s>[INST] {instruction} [/INST]"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode and strip the prompt from the response
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return full_output[len(prompt):].strip()

# Test it!
response = generate_response("Write a Prisma schema for an e-commerce product with variants.")
print(response)

Step 7: Export for Local Use with Ollama

# Step 1: Merge LoRA weights into base model
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

print('Loading base model...')
base_model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1', torch_dtype=torch.float16)

print('Loading adapter...')
model = PeftModel.from_pretrained(base_model, './mistral-lora-adapter')

print('Merging weights...')
model = model.merge_and_unload()

print('Saving merged model...')
model.save_pretrained('./mistral-merged')

tokenizer = AutoTokenizer.from_pretrained('./mistral-lora-adapter')
tokenizer.save_pretrained('./mistral-merged')
print('Done!')
"

# Step 2: Clone llama.cpp for GGUF conversion
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Step 3: Convert to GGUF format
python convert.py ../mistral-merged \
  --outtype f16 \
  --outfile ../mistral-finetuned-f16.gguf

# Step 4: Quantize to Q4_K_M (best quality/size tradeoff)
./quantize ../mistral-finetuned-f16.gguf ../mistral-finetuned-q4km.gguf Q4_K_M

# Step 5: Create Ollama Modelfile
cat > Modelfile << 'EOF'
FROM ./mistral-finetuned-q4km.gguf

TEMPLATE """<s>[INST] {{ .Prompt }} [/INST]"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
EOF

# Step 6: Register with Ollama
ollama create mistral-custom -f Modelfile

# Step 7: Test!
ollama run mistral-custom "Write a Next.js API route for user authentication"

Common Issues and Fixes

| Issue | Fix | |-------|-----| | CUDA out of memory | Reduce per_device_train_batch_size to 2, increase gradient_accumulation_steps to 8 | | Training loss not decreasing | Check dataset formatting — most common cause | | Model gives generic responses | Dataset quality issue — add more domain-specific samples | | Quantization errors | Use torch_dtype=torch.float16 consistently | | Slow training | Enable packing=True if your samples are short; use group_by_length=True |

Expected Timeline

| Phase | Time | |-------|------| | Environment setup | 30 min | | Dataset prep (2000 samples) | 3–6 hours | | Training (3 epochs, A100) | 3–5 hours | | Evaluation & testing | 1–2 hours | | GGUF conversion + Ollama setup | 30 min | | Total | ~12 hours |

Related post: Fine-tuning Mistral-7B for Domain-Specific Engineering Tasks All code tested on: Python 3.11, transformers 4.38, peft 0.10, CUDA 12.1