February 19, 2026 · 16 min
JarvisX V2: From Fine-Tuning to Cloud + Local Deployment — A Full Case Study
A complete engineering case study: the problem, the architecture, the fine-tuning process, the VS Code extension, and hard lessons learned building JarvisX V2 — my personal AI development assistant.
JarvisX V2: From Fine-Tuning to Cloud + Local Deployment — A Full Case Study
TL;DR: JarvisX is my personal AI-powered development assistant. This is the complete story — from the problem that motivated it, to the architecture I built, the fine-tuning I did, and the real lessons across 8 months of building and using it daily.
The Problem I Was Solving
By mid-2024, I was using several AI tools simultaneously:
- ChatGPT for general questions
- GitHub Copilot for inline completions
- Claude for architecture discussions
Every tool lacked the same thing: context about my work. Each conversation started from zero. I was constantly explaining my tech stack, my project architecture, my coding conventions.
Worse, sensitive project code was going to cloud servers I didn't control.
I wanted one tool that:
- Knew my projects, tech stack, and conventions without being told
- Worked offline on restricted networks
- Integrated directly into VS Code without context-switching
- Could be fine-tuned to my specific domain
Phase 1: Defining the Architecture (Month 1–2)
Before writing code, I spent two weeks designing the system carefully. The decisions at this stage shaped everything that followed.
Core Architectural Decisions
Decision 1: Hybrid inference (local + cloud) Local models for 80% of tasks, cloud models for complex architecture questions. This kept privacy high and cost low.
Decision 2: Persistent memory with vector embeddings Every conversation, code snippet, and project decision gets embedded and stored in a local vector database. JarvisX retrieves relevant memories for every new query.
Decision 3: VS Code as the primary interface Not a browser tab, not a separate app — embedded directly in the editor where the work happens.
Decision 4: Node.js server as the orchestrator A local Node.js server mediates between the VS Code extension, the local models (Ollama), and optional cloud APIs. This keeps the extension simple and the intelligence server-side.
Phase 2: The Fine-tuning Journey (Month 2–4)
Off-the-shelf Mistral-7B was good but not trained on the specific patterns I use daily. Fine-tuning fixed this.
Dataset Collection
I collected 18,000 training samples across:
- TypeScript/Next.js code patterns from my active projects
- Prisma schema examples with explanations
- System design Q&A pairs (generated via GPT-4o as a labeler)
- Architecture decision records (ADRs) I'd written
- Error analysis pairs (error → root cause → fix)
Training Configuration
# LoRA fine-tuning config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM
)
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
optim="paged_adamw_32bit",
)
Training time: 4.5 hours on an A100 40GB (cloud instance). Cost: ~$12.
What Improved, What Didn't
✅ Improved: TypeScript/Next.js code generation accuracy ✅ Improved: Prisma schema patterns (correct relations, index conventions) ✅ Improved: Architecture recommendations aligned to my style ❌ Didn't improve: Mathematical reasoning (expected — LoRA doesn't fix fundamentals) ❌ Didn't improve: Very long context tasks (model context window limitation)
Phase 3: Building the Core System (Month 3–5)
The Memory Engine
This is what makes JarvisX meaningfully different from other tools:
// Every interaction gets stored
async function storeInteraction(query: string, response: string, metadata: InteractionMeta) {
const embedding = await embed(`${query}\n${response}`);
await db.run(
`INSERT INTO memories (content, embedding, project, file, timestamp, tags)
VALUES (?, ?, ?, ?, ?, ?)`,
[
`Q: ${query}\nA: ${response}`,
JSON.stringify(embedding),
metadata.currentProject,
metadata.currentFile,
Date.now(),
JSON.stringify(metadata.extractedTags),
]
);
}
// Relevant memories retrieved for every new query
async function getRelevantContext(query: string): Promise<Memory[]> {
const queryEmbedding = await embed(query);
return db.all(
`SELECT content, timestamp, project,
vss_distance_l2(embedding, ?) as distance
FROM memories
WHERE vss_distance_l2(embedding, ?) < 0.8
ORDER BY distance ASC
LIMIT 5`,
[queryEmbedding, queryEmbedding]
);
}
After 2 months of daily use, JarvisX has ~3,400 stored memories. The quality of responses improved dramatically as the memory filled up.
The Model Router
async function selectModel(query: string, context: QueryContext): Promise<ModelConfig> {
if (!context.isOnline) return LOCAL_MODELS.primary;
const complexity = await assessComplexity(query);
// Routing logic based on real usage patterns
if (context.currentFile && complexity < 0.5) return LOCAL_MODELS.primary; // Code tasks: local
if (query.length < 100) return LOCAL_MODELS.primary; // Short queries: local
if (complexity > 0.75) return CLOUD_MODELS.best; // Complex: cloud
if (context.hasPrivateCode) return LOCAL_MODELS.primary; // Private code: local
return LOCAL_MODELS.primary; // Default: local
}
In practice, ~82% of queries go to the local model. Cloud is triggered primarily for complex architecture questions and cross-project reasoning.
Phase 4: The VS Code Extension (Month 4–6)
The extension itself is fairly lightweight — it's primarily a UI layer:
Features Built
- Sidebar chat panel — persistent conversation with full history
- Inline ask (
Ctrl+Shift+J) — asks JarvisX about selected code - Error explainer — right-click on red-squiggle → JarvisX explain
- Test generator — right-click on function → generate tests
- Refactor assistant — suggests refactoring opportunities in current file
- Terminal error capture — auto-adds terminal errors to context
Context Collection
async function buildQueryContext(): Promise<QueryContext> {
const editor = vscode.window.activeTextEditor;
return {
// Direct context
selectedCode: getSelectedCode(editor),
currentFile: editor?.document.fileName,
language: editor?.document.languageId,
// Extended context
openFiles: vscode.workspace.textDocuments.map(d => d.fileName),
recentEdits: await getRecentEdits(5), // Last 5 file changes
diagnostics: getActiveDiagnostics(), // Current lint/type errors
gitStatus: await getGitStatus(), // Changed files, branch name
// Project context
packageJson: await readProjectPackageJson(),
tsConfig: await readTsConfig(),
// Connectivity
isOnline: await checkConnectivity(),
};
}
Phase 5: Deployment & Daily Use (Month 6–8)
Deployment Architecture
Developer Machine:
├── VS Code Extension (installed globally)
├── JarvisX Server (Node.js daemon, auto-starts)
│ ├── localhost:3721 (HTTP API)
│ └── SQLite DB (~/.jarvisx/memory.db)
└── Ollama (daemon, auto-starts)
├── localhost:11434
└── Models (~/.ollama/models/)
Optional Cloud:
└── OpenAI API (gpt-4o) for complex queries
Usage Stats (First 3 Months Daily Use)
| Metric | Value | |--------|-------| | Total queries | 2,847 | | Local model queries | 2,334 (82%) | | Cloud model queries | 513 (18%) | | Avg response time (local) | 1.8s | | Avg response time (cloud) | 2.4s | | Total cloud cost | $8.23 | | Memories stored | 3,412 | | Projects in context | 7 |
Total cloud cost of $8.23 over 3 months for 2,847 AI-assisted interactions is significantly cheaper than any subscription tool.
Hard Lessons Learned
1. The dataset was everything
I had to retrain twice because of data quality issues. The second run had stricter deduplication and better formatting — results improved dramatically. Spend 60% of your time on data.
2. Memory retrieval needs aggressive curation
After 6 months, some memories became "stale" (referred to old patterns I'd moved away from). I added a recency penalty and a manual "forget" command to keep memory quality high.
3. Local models cold-start kills UX
Ollama takes 2–4 seconds to load a model on first use. I added a background warm-up on VS Code startup. Now the first real request feels instant.
4. Don't underestimate the extension UX
I built the functionality in 3 weeks. Getting the UX right took 6 more weeks. Streaming responses, scroll behavior, code block formatting, and keyboard shortcuts all matter enormously.
5. The memory engine became the product
Initially I thought the fine-tuned model was the key differentiator. After using JarvisX daily, the memory engine is what I'd miss most if it was taken away. Persistent context is the real moat.
What's Next
- Agent mode — let JarvisX autonomously execute multi-step tasks (read file, make change, run tests, commit)
- Team mode — shared memory across a team, with attribution
- Plugin system — let others build nodes for JarvisX's tool engine
Code and Resources
This is part 1 of the JarvisX series. Next: Deep-diving into the memory engine design.