JarvisX V2: From Fine-Tuning to Cloud + Local Deployment — A Full Case Study

TL;DR: JarvisX is my personal AI-powered development assistant. This is the complete story — from the problem that motivated it, to the architecture I built, the fine-tuning I did, and the real lessons across 8 months of building and using it daily.

The Problem I Was Solving

By mid-2024, I was using several AI tools simultaneously:

ChatGPT for general questions
GitHub Copilot for inline completions
Claude for architecture discussions

Every tool lacked the same thing: context about my work. Each conversation started from zero. I was constantly explaining my tech stack, my project architecture, my coding conventions.

Worse, sensitive project code was going to cloud servers I didn't control.

I wanted one tool that:

Knew my projects, tech stack, and conventions without being told
Worked offline on restricted networks
Integrated directly into VS Code without context-switching
Could be fine-tuned to my specific domain

Phase 1: Defining the Architecture (Month 1–2)

Before writing code, I spent two weeks designing the system carefully. The decisions at this stage shaped everything that followed.

Core Architectural Decisions

Decision 1: Hybrid inference (local + cloud) Local models for 80% of tasks, cloud models for complex architecture questions. This kept privacy high and cost low.

Decision 2: Persistent memory with vector embeddings Every conversation, code snippet, and project decision gets embedded and stored in a local vector database. JarvisX retrieves relevant memories for every new query.

Decision 3: VS Code as the primary interface Not a browser tab, not a separate app — embedded directly in the editor where the work happens.

Decision 4: Node.js server as the orchestrator A local Node.js server mediates between the VS Code extension, the local models (Ollama), and optional cloud APIs. This keeps the extension simple and the intelligence server-side.

Phase 2: The Fine-tuning Journey (Month 2–4)

Off-the-shelf Mistral-7B was good but not trained on the specific patterns I use daily. Fine-tuning fixed this.

Dataset Collection

I collected 18,000 training samples across:

TypeScript/Next.js code patterns from my active projects
Prisma schema examples with explanations
System design Q&A pairs (generated via GPT-4o as a labeler)
Architecture decision records (ADRs) I'd written
Error analysis pairs (error → root cause → fix)

Training Configuration

# LoRA fine-tuning config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    optim="paged_adamw_32bit",
)

Training time: 4.5 hours on an A100 40GB (cloud instance). Cost: ~$12.

What Improved, What Didn't

✅ Improved: TypeScript/Next.js code generation accuracy ✅ Improved: Prisma schema patterns (correct relations, index conventions) ✅ Improved: Architecture recommendations aligned to my style ❌ Didn't improve: Mathematical reasoning (expected — LoRA doesn't fix fundamentals) ❌ Didn't improve: Very long context tasks (model context window limitation)

Phase 3: Building the Core System (Month 3–5)

The Memory Engine

This is what makes JarvisX meaningfully different from other tools:

// Every interaction gets stored
async function storeInteraction(query: string, response: string, metadata: InteractionMeta) {
  const embedding = await embed(`${query}\n${response}`);
  
  await db.run(
    `INSERT INTO memories (content, embedding, project, file, timestamp, tags)
     VALUES (?, ?, ?, ?, ?, ?)`,
    [
      `Q: ${query}\nA: ${response}`,
      JSON.stringify(embedding),
      metadata.currentProject,
      metadata.currentFile,
      Date.now(),
      JSON.stringify(metadata.extractedTags),
    ]
  );
}

// Relevant memories retrieved for every new query
async function getRelevantContext(query: string): Promise<Memory[]> {
  const queryEmbedding = await embed(query);
  
  return db.all(
    `SELECT content, timestamp, project,
            vss_distance_l2(embedding, ?) as distance
     FROM memories
     WHERE vss_distance_l2(embedding, ?) < 0.8
     ORDER BY distance ASC
     LIMIT 5`,
    [queryEmbedding, queryEmbedding]
  );
}

After 2 months of daily use, JarvisX has ~3,400 stored memories. The quality of responses improved dramatically as the memory filled up.

The Model Router

async function selectModel(query: string, context: QueryContext): Promise<ModelConfig> {
  if (!context.isOnline) return LOCAL_MODELS.primary;
  
  const complexity = await assessComplexity(query);
  
  // Routing logic based on real usage patterns
  if (context.currentFile && complexity < 0.5) return LOCAL_MODELS.primary; // Code tasks: local
  if (query.length < 100) return LOCAL_MODELS.primary;                       // Short queries: local
  if (complexity > 0.75) return CLOUD_MODELS.best;                           // Complex: cloud
  if (context.hasPrivateCode) return LOCAL_MODELS.primary;                    // Private code: local
  
  return LOCAL_MODELS.primary; // Default: local
}

In practice, ~82% of queries go to the local model. Cloud is triggered primarily for complex architecture questions and cross-project reasoning.

Phase 4: The VS Code Extension (Month 4–6)

The extension itself is fairly lightweight — it's primarily a UI layer:

Features Built

Sidebar chat panel — persistent conversation with full history
Inline ask (Ctrl+Shift+J) — asks JarvisX about selected code
Error explainer — right-click on red-squiggle → JarvisX explain
Test generator — right-click on function → generate tests
Refactor assistant — suggests refactoring opportunities in current file
Terminal error capture — auto-adds terminal errors to context

Context Collection

async function buildQueryContext(): Promise<QueryContext> {
  const editor = vscode.window.activeTextEditor;
  
  return {
    // Direct context
    selectedCode: getSelectedCode(editor),
    currentFile: editor?.document.fileName,
    language: editor?.document.languageId,
    
    // Extended context
    openFiles: vscode.workspace.textDocuments.map(d => d.fileName),
    recentEdits: await getRecentEdits(5),  // Last 5 file changes
    diagnostics: getActiveDiagnostics(),    // Current lint/type errors
    gitStatus: await getGitStatus(),        // Changed files, branch name
    
    // Project context
    packageJson: await readProjectPackageJson(),
    tsConfig: await readTsConfig(),
    
    // Connectivity
    isOnline: await checkConnectivity(),
  };
}

Phase 5: Deployment & Daily Use (Month 6–8)

Deployment Architecture

Developer Machine:
├── VS Code Extension (installed globally)
├── JarvisX Server (Node.js daemon, auto-starts)
│   ├── localhost:3721 (HTTP API)
│   └── SQLite DB (~/.jarvisx/memory.db)
└── Ollama (daemon, auto-starts)
    ├── localhost:11434 
    └── Models (~/.ollama/models/)

Optional Cloud:
└── OpenAI API (gpt-4o) for complex queries

Usage Stats (First 3 Months Daily Use)

| Metric | Value | |--------|-------| | Total queries | 2,847 | | Local model queries | 2,334 (82%) | | Cloud model queries | 513 (18%) | | Avg response time (local) | 1.8s | | Avg response time (cloud) | 2.4s | | Total cloud cost | $8.23 | | Memories stored | 3,412 | | Projects in context | 7 |

Total cloud cost of $8.23 over 3 months for 2,847 AI-assisted interactions is significantly cheaper than any subscription tool.

Hard Lessons Learned

1. The dataset was everything

I had to retrain twice because of data quality issues. The second run had stricter deduplication and better formatting — results improved dramatically. Spend 60% of your time on data.

2. Memory retrieval needs aggressive curation

After 6 months, some memories became "stale" (referred to old patterns I'd moved away from). I added a recency penalty and a manual "forget" command to keep memory quality high.

3. Local models cold-start kills UX

Ollama takes 2–4 seconds to load a model on first use. I added a background warm-up on VS Code startup. Now the first real request feels instant.

4. Don't underestimate the extension UX

I built the functionality in 3 weeks. Getting the UX right took 6 more weeks. Streaming responses, scroll behavior, code block formatting, and keyboard shortcuts all matter enormously.

5. The memory engine became the product

Initially I thought the fine-tuned model was the key differentiator. After using JarvisX daily, the memory engine is what I'd miss most if it was taken away. Persistent context is the real moat.

What's Next

Agent mode — let JarvisX autonomously execute multi-step tasks (read file, make change, run tests, commit)
Team mode — shared memory across a team, with attribution
Plugin system — let others build nodes for JarvisX's tool engine

Code and Resources

This is part 1 of the JarvisX series. Next: Deep-diving into the memory engine design.