Offline AI Development with VS Code Extension and Local Inference

TL;DR: What if your entire AI development stack worked without internet? No API keys. No data leaving your machine. No rate limits. Here's how I built exactly that — and what I learned making it production-ready.

The Case for Offline AI

Modern AI coding tools are powerful, but they come with a hidden cost: everything you type goes to someone else's server.

For most developers that's fine. But there are real scenarios where it isn't:

Working on proprietary codebases with NDAs
Traveling or in areas with poor connectivity
Security-conscious enterprise environments
Simply wanting full control over your data

When building JarvisX, I made offline-first a core design principle — not an afterthought.

The Stack

My offline AI development environment consists of four components:

┌─────────────────────────────────────────────────────┐
│                  Developer Machine                   │
│                                                     │
│  ┌──────────────────┐     ┌────────────────────┐   │
│  │  VS Code         │     │  JarvisX Server    │   │
│  │  Extension       │────►│  (Node.js:3721)    │   │
│  │  (TypeScript)    │     └────────┬───────────┘   │
│  └──────────────────┘              │               │
│                                    ▼               │
│                          ┌─────────────────┐       │
│                          │  Ollama Server  │       │
│                          │  (Port: 11434)  │       │
│                          └────────┬────────┘       │
│                                   │                │
│              ┌────────────────────┼──────┐         │
│              ▼                    ▼      ▼         │
│     ┌────────────────┐  ┌──────────┐  ┌────────┐  │
│     │ mistral-eng:q4 │  │codellama │  │nomic-  │  │
│     │ (7B params)    │  │ 7b:q4    │  │embed   │  │
│     └────────────────┘  └──────────┘  └────────┘  │
└─────────────────────────────────────────────────────┘

Setting Up Local Inference with Ollama

Ollama is the simplest way to run LLMs locally. It handles model management, serves an OpenAI-compatible REST API, and handles quantization automatically.

# Install Ollama
brew install ollama

# Pull the models I use
ollama pull mistral
ollama pull codellama
ollama pull nomic-embed-text   # for embeddings

# Verify they're running
ollama list
# NAME                    ID              SIZE    MODIFIED
# mistral:latest          61e88e884507    4.1 GB  2 hours ago
# codellama:latest        8fdf8f752f6e    3.8 GB  2 hours ago
# nomic-embed-text        latest          274 MB  2 hours ago

Ollama's API is OpenAI-compatible, so I can use the same code for local and cloud:

import OpenAI from 'openai';

// Local inference - just change baseURL
const localClient = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Required by SDK, ignored by Ollama
});

// Cloud fallback - same API
const cloudClient = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Same call, different backend
const response = await (isOffline ? localClient : cloudClient).chat.completions.create({
  model: isOffline ? 'mistral' : 'gpt-4o',
  messages: [{ role: 'user', content: prompt }],
});

Building the VS Code Extension

The extension is the developer-facing layer. It took about two weeks of iteration to get the UX right.

Extension Manifest (package.json)

{
  "name": "jarvisx",
  "displayName": "JarvisX AI",
  "description": "Local AI-powered development assistant",
  "version": "1.0.0",
  "engines": { "vscode": "^1.85.0" },
  "activationEvents": ["onStartupFinished"],
  "contributes": {
    "commands": [
      {
        "command": "jarvisx.askAboutCode",
        "title": "JarvisX: Ask About Selected Code"
      },
      {
        "command": "jarvisx.explainError",
        "title": "JarvisX: Explain This Error"
      },
      {
        "command": "jarvisx.generateTests",
        "title": "JarvisX: Generate Tests for Function"
      }
    ],
    "keybindings": [
      {
        "command": "jarvisx.askAboutCode",
        "key": "ctrl+shift+j",
        "when": "editorTextFocus"
      }
    ]
  }
}

Capturing Code Context

The most important job of the extension is gathering rich context automatically:

async function captureContext(): Promise<DevContext> {
  const editor = vscode.window.activeTextEditor;
  if (!editor) return {};

  const document = editor.document;
  const selection = editor.selection;

  return {
    // Selected code or current function
    selectedCode: document.getText(selection) || getCurrentFunction(editor),
    
    // File metadata
    language: document.languageId,
    filePath: document.fileName,
    
    // Surrounding context (50 lines above and below)
    surroundingCode: getSurroundingCode(editor, 50),
    
    // Project info from package.json
    projectInfo: await getProjectInfo(vscode.workspace.rootPath),
    
    // Recent git changes
    recentChanges: await getGitDiff(),
    
    // Open problems (lint errors, etc.)
    diagnostics: getDiagnostics(document.uri),
  };
}

The Chat Panel (WebviewPanel)

class JarvisXChatPanel {
  private panel: vscode.WebviewPanel;

  constructor(context: vscode.ExtensionContext) {
    this.panel = vscode.window.createWebviewPanel(
      'jarvisxChat',
      'JarvisX',
      vscode.ViewColumn.Beside,
      { enableScripts: true, retainContextWhenHidden: true }
    );

    // Handle messages from the webview
    this.panel.webview.onDidReceiveMessage(async (message) => {
      if (message.type === 'userMessage') {
        const context = await captureContext();
        const stream = await jarvisxClient.streamChat(message.text, context);
        
        for await (const chunk of stream) {
          this.panel.webview.postMessage({ type: 'aiChunk', content: chunk });
        }
      }
    });
  }
}

Streaming Responses

Nobody wants to stare at a spinner for 10 seconds. Streaming token-by-token from Ollama makes the local model feel much faster:

async function* streamFromOllama(prompt: string): AsyncGenerator<string> {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    body: JSON.stringify({
      model: 'mistral',
      prompt,
      stream: true,
    }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n').filter(Boolean);

    for (const line of lines) {
      const data = JSON.parse(line);
      if (data.response) yield data.response;
      if (data.done) return;
    }
  }
}

Performance: Local vs Cloud

Running on Apple M2 Pro (16GB RAM):

| Model | Speed (tok/s) | Latency (first token) | Quality | |-------|--------------|----------------------|---------| | mistral:q4_k_m | ~35 | ~1.2s | Good | | codellama:q4 | ~30 | ~1.4s | Excellent (code) | | gpt-4o (cloud) | ~80 | ~0.8s | Best | | claude-3-5-sonnet | ~75 | ~1.0s | Best |

For most daily tasks — quick code completions, explaining functions, generating test stubs — the local models are fast enough. Cloud models are reserved for complex architecture questions.

Handling the Offline/Online Transition

One tricky edge case: what happens when you go offline mid-session?

class ConnectionMonitor {
  private isOnline = true;
  
  constructor() {
    // Poll every 10 seconds
    setInterval(async () => {
      const newStatus = await this.checkConnectivity();
      
      if (this.isOnline !== newStatus) {
        this.isOnline = newStatus;
        vscode.window.showInformationMessage(
          newStatus 
            ? '🟢 JarvisX: Back online — cloud models available'
            : '🔴 JarvisX: Offline — switched to local models'
        );
        modelRouter.notifyConnectivityChange(newStatus);
      }
    }, 10_000);
  }

  private async checkConnectivity(): Promise<boolean> {
    try {
      await fetch('https://1.1.1.1', { signal: AbortSignal.timeout(2000) });
      return true;
    } catch {
      return false;
    }
  }
}

The Privacy Guarantee

When running fully local:

✅ No outbound network requests for inference
✅ Models stored locally at ~/.ollama/models/
✅ Memory/context stored in local SQLite
✅ No telemetry (I disabled it in the extension)

I validated this using lsof and Wireshark to confirm zero external connections during a local-only session.

Lessons Learned

Model loading time — on startup, preload the models you use most. Cold start is 2–4 seconds.
Quantization level matters — Q4_K_M is the sweet spot. Don't go below Q4 for quality-sensitive tasks.
Context length vs speed — longer contexts slow inference significantly. Trim aggressively.
Extension UX is hard — VS Code's WebviewPanel API has quirks. Budget extra time for the UI.
Users need status indicators — always show which model is active and whether you're in offline mode.

Want to try it? JarvisX on GitHub | My Portfolio