← Back to tech insights

February 19, 2026 · 14 min

Building a Multi-Mode AI Assistant: Architecture and Lessons Learned

How I architected JarvisX V2 — a multi-mode AI assistant that switches between cloud and local inference, maintains persistent memory, and integrates directly into the development workflow.

Building a Multi-Mode AI Assistant: Architecture and Lessons Learned

TL;DR: JarvisX is my personal AI development assistant. It runs locally, integrates with VS Code, remembers your project context, and can switch between cloud and local models. This post walks through the full architecture, the decisions behind it, and the hard lessons I learned along the way.


Why Build Another AI Coding Assistant?

By the time I started JarvisX, GitHub Copilot and Cursor already existed. But I had specific problems they couldn't solve:

  • No persistent memory — every chat started fresh, I constantly re-explained my project
  • Cloud dependency — I couldn't use them on restricted networks or offline
  • No workflow integration — they couldn't read my terminal output, inspect my Figma files, or understand my project architecture

So I built my own.


System Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    JarvisX System                            │
│                                                             │
│  ┌─────────────┐    ┌──────────────────┐    ┌───────────┐  │
│  │  VS Code    │    │   Orchestrator   │    │  Memory   │  │
│  │  Extension  │◄──►│    (Node.js)     │◄──►│  Engine   │  │
│  └─────────────┘    └──────┬───────────┘    │ (SQLite + │  │
│                            │                │  Vectors) │  │
│  ┌─────────────┐           │                └───────────┘  │
│  │   Figma     │           ▼                               │
│  │   Plugin    │    ┌──────────────┐    ┌───────────────┐  │
│  └─────────────┘    │ Model Router │    │  Tool Engine  │  │
│                     └──────┬───────┘    └───────────────┘  │
│                            │                               │
│              ┌─────────────┼─────────────┐                 │
│              ▼             ▼             ▼                  │
│        ┌──────────┐ ┌──────────┐ ┌──────────┐             │
│        │  Ollama  │ │ llama.cpp│ │ OpenAI   │             │
│        │  Local   │ │  Local   │ │  Cloud   │             │
│        └──────────┘ └──────────┘ └──────────┘             │
└─────────────────────────────────────────────────────────────┘

There are five core layers:

  1. Interface Layer — VS Code extension + Figma plugin as entry points
  2. Orchestrator — The brain: receives requests, decides routing, manages context
  3. Model Router — Selects the right model and inference backend
  4. Memory Engine — Persistent context across sessions
  5. Tool Engine — Executes actions (file read, terminal, search)

The Orchestrator: Request Lifecycle

Every request from the VS Code extension follows this path:

// Simplified orchestrator flow
async function handleRequest(request: AIRequest): Promise<AIResponse> {
  // 1. Retrieve relevant memory
  const context = await memoryEngine.retrieve(request.query, {
    limit: 5,
    threshold: 0.75
  });

  // 2. Classify the request type
  const requestType = await classifyRequest(request.query);
  // returns: 'code' | 'architecture' | 'explanation' | 'generation' | 'tool_use'

  // 3. Route to appropriate model
  const model = await modelRouter.select({
    requestType,
    complexity: estimateComplexity(request.query),
    networkAvailable: await checkNetwork(),
    preferredMode: request.userPreference
  });

  // 4. Build prompt with context
  const prompt = promptEngine.build({
    systemContext: context,
    userQuery: request.query,
    tools: toolEngine.getAvailableTools(),
    model: model.id
  });

  // 5. Run inference
  const response = await model.infer(prompt);

  // 6. Execute any tool calls in the response
  const finalResponse = await toolEngine.execute(response);

  // 7. Store in memory
  await memoryEngine.store({
    query: request.query,
    response: finalResponse.content,
    timestamp: Date.now()
  });

  return finalResponse;
}

The Model Router: Intelligent Mode Switching

This is the part I'm most proud of. JarvisX doesn't use a single model — it dynamically switches based on:

| Signal | Local Model | Cloud Model | |--------|------------|-------------| | Network unavailable | ✅ | ❌ | | Quick code completion | ✅ | ❌ | | Complex architecture question | ❌ | ✅ | | User explicitly requests offline | ✅ | ❌ | | Sensitive/private code | ✅ | ❌ |

class ModelRouter {
  private localModels = ['mistral-engineering:q4', 'codellama:7b'];
  private cloudModels = ['gpt-4o', 'claude-3-5-sonnet'];

  async select(criteria: RoutingCriteria): Promise<Model> {
    // Hard constraints
    if (!criteria.networkAvailable) return this.getBestLocal();
    if (criteria.preferredMode === 'local') return this.getBestLocal();
    if (criteria.containsSensitiveData) return this.getBestLocal();

    // Soft routing by complexity
    if (criteria.complexity < 0.4) return this.getBestLocal();
    if (criteria.complexity > 0.75) return this.getBestCloud();

    // Default: local for speed
    return this.getBestLocal();
  }
}

Memory Engine: Persistent Context

This was the biggest differentiator from existing tools. The memory engine has two layers:

Short-term Memory (Session Context)

Simple in-memory array of recent messages, cleared between sessions.

Long-term Memory (Vector Store)

I used SQLite with the sqlite-vss extension for vector similarity search — keeping everything local without needing a vector database server.

// Storing a memory
async store(entry: MemoryEntry): Promise<void> {
  const embedding = await this.embedder.embed(entry.content);
  
  await db.run(`
    INSERT INTO memories (content, embedding, timestamp, session_id, tags)
    VALUES (?, ?, ?, ?, ?)
  `, [entry.content, JSON.stringify(embedding), entry.timestamp, entry.sessionId, entry.tags]);
}

// Retrieving relevant memories
async retrieve(query: string, options: RetrievalOptions): Promise<MemoryEntry[]> {
  const queryEmbedding = await this.embedder.embed(query);
  
  return db.all(`
    SELECT content, timestamp, 
           vss_distance_l2(embedding, ?) as distance
    FROM memories
    WHERE vss_distance_l2(embedding, ?) < ?
    ORDER BY distance ASC
    LIMIT ?
  `, [queryEmbedding, queryEmbedding, options.threshold, options.limit]);
}

The VS Code Extension

The extension is the primary interface. Key features:

  • Inline code suggestions — triggered by Ctrl+Shift+J
  • Chat panel — persistent sidebar conversation
  • Context injection — automatically includes current file and selected code
  • Terminal listening — captures error output and adds it to context automatically
// Auto-capturing terminal output for error context
const terminal = vscode.window.createTerminal();
terminal.onDidWriteData((data) => {
  if (data.includes('Error:') || data.includes('TypeError:')) {
    contextManager.addTemporary({
      type: 'terminal_error',
      content: data,
      ttl: 60_000 // 60 seconds
    });
  }
});

The Figma Integration

The Figma plugin was an unexpected addition. It reads the current Figma frame and sends the component tree + visual hierarchy to JarvisX, which then generates:

  • React component structure
  • Tailwind CSS class suggestions
  • Accessibility annotations
  • Component naming based on design patterns

Lessons Learned

1. Context Window is Sacred

I initially stuffed everything into the prompt. 8K tokens fills up fast. I learned to:

  • Score memories by relevance, not recency
  • Summarize older context with a small model before including it
  • Keep tool outputs terse

2. Mode Switching UX Matters

Users need to know which model is responding. I added a small indicator in the VS Code status bar — a small thing that eliminated a lot of confusion.

3. Local Models Need Warming

llama.cpp takes 2–4 seconds to load a model on first inference. I added a background warm-up on VS Code startup to make the first real request feel instant.

4. Tool Calls Break Often

I spent more time making tool execution reliable than building the models. Structured JSON output from LLMs is still unreliable — always validate and have fallback parsing.

5. Memory Retrieval Threshold Tuning

Too low a similarity threshold: irrelevant memories pollute context. Too high: nothing gets retrieved. Start at 0.75 cosine similarity and tune from there.


What's Next for JarvisX

  • Agent mode — fully autonomous multi-step task execution
  • GitHub integration — reading PR diffs and suggesting reviews
  • Voice interface — Whisper-based speech-to-text for hands-free coding

Explore the project: JarvisX on GitHub | Portfolio