February 19, 2026 · 10 min
Offline AI Development with VS Code Extension and Local Inference
Building a fully offline AI-powered development environment using a custom VS Code extension, Ollama, and locally quantized models — zero cloud dependency, full privacy.
Offline AI Development with VS Code Extension and Local Inference
TL;DR: What if your entire AI development stack worked without internet? No API keys. No data leaving your machine. No rate limits. Here's how I built exactly that — and what I learned making it production-ready.
The Case for Offline AI
Modern AI coding tools are powerful, but they come with a hidden cost: everything you type goes to someone else's server.
For most developers that's fine. But there are real scenarios where it isn't:
- Working on proprietary codebases with NDAs
- Traveling or in areas with poor connectivity
- Security-conscious enterprise environments
- Simply wanting full control over your data
When building JarvisX, I made offline-first a core design principle — not an afterthought.
The Stack
My offline AI development environment consists of four components:
┌─────────────────────────────────────────────────────┐
│ Developer Machine │
│ │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ VS Code │ │ JarvisX Server │ │
│ │ Extension │────►│ (Node.js:3721) │ │
│ │ (TypeScript) │ └────────┬───────────┘ │
│ └──────────────────┘ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Ollama Server │ │
│ │ (Port: 11434) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────────────┼──────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────────┐ ┌──────────┐ ┌────────┐ │
│ │ mistral-eng:q4 │ │codellama │ │nomic- │ │
│ │ (7B params) │ │ 7b:q4 │ │embed │ │
│ └────────────────┘ └──────────┘ └────────┘ │
└─────────────────────────────────────────────────────┘
Setting Up Local Inference with Ollama
Ollama is the simplest way to run LLMs locally. It handles model management, serves an OpenAI-compatible REST API, and handles quantization automatically.
# Install Ollama
brew install ollama
# Pull the models I use
ollama pull mistral
ollama pull codellama
ollama pull nomic-embed-text # for embeddings
# Verify they're running
ollama list
# NAME ID SIZE MODIFIED
# mistral:latest 61e88e884507 4.1 GB 2 hours ago
# codellama:latest 8fdf8f752f6e 3.8 GB 2 hours ago
# nomic-embed-text latest 274 MB 2 hours ago
Ollama's API is OpenAI-compatible, so I can use the same code for local and cloud:
import OpenAI from 'openai';
// Local inference - just change baseURL
const localClient = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // Required by SDK, ignored by Ollama
});
// Cloud fallback - same API
const cloudClient = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Same call, different backend
const response = await (isOffline ? localClient : cloudClient).chat.completions.create({
model: isOffline ? 'mistral' : 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
});
Building the VS Code Extension
The extension is the developer-facing layer. It took about two weeks of iteration to get the UX right.
Extension Manifest (package.json)
{
"name": "jarvisx",
"displayName": "JarvisX AI",
"description": "Local AI-powered development assistant",
"version": "1.0.0",
"engines": { "vscode": "^1.85.0" },
"activationEvents": ["onStartupFinished"],
"contributes": {
"commands": [
{
"command": "jarvisx.askAboutCode",
"title": "JarvisX: Ask About Selected Code"
},
{
"command": "jarvisx.explainError",
"title": "JarvisX: Explain This Error"
},
{
"command": "jarvisx.generateTests",
"title": "JarvisX: Generate Tests for Function"
}
],
"keybindings": [
{
"command": "jarvisx.askAboutCode",
"key": "ctrl+shift+j",
"when": "editorTextFocus"
}
]
}
}
Capturing Code Context
The most important job of the extension is gathering rich context automatically:
async function captureContext(): Promise<DevContext> {
const editor = vscode.window.activeTextEditor;
if (!editor) return {};
const document = editor.document;
const selection = editor.selection;
return {
// Selected code or current function
selectedCode: document.getText(selection) || getCurrentFunction(editor),
// File metadata
language: document.languageId,
filePath: document.fileName,
// Surrounding context (50 lines above and below)
surroundingCode: getSurroundingCode(editor, 50),
// Project info from package.json
projectInfo: await getProjectInfo(vscode.workspace.rootPath),
// Recent git changes
recentChanges: await getGitDiff(),
// Open problems (lint errors, etc.)
diagnostics: getDiagnostics(document.uri),
};
}
The Chat Panel (WebviewPanel)
class JarvisXChatPanel {
private panel: vscode.WebviewPanel;
constructor(context: vscode.ExtensionContext) {
this.panel = vscode.window.createWebviewPanel(
'jarvisxChat',
'JarvisX',
vscode.ViewColumn.Beside,
{ enableScripts: true, retainContextWhenHidden: true }
);
// Handle messages from the webview
this.panel.webview.onDidReceiveMessage(async (message) => {
if (message.type === 'userMessage') {
const context = await captureContext();
const stream = await jarvisxClient.streamChat(message.text, context);
for await (const chunk of stream) {
this.panel.webview.postMessage({ type: 'aiChunk', content: chunk });
}
}
});
}
}
Streaming Responses
Nobody wants to stare at a spinner for 10 seconds. Streaming token-by-token from Ollama makes the local model feel much faster:
async function* streamFromOllama(prompt: string): AsyncGenerator<string> {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
body: JSON.stringify({
model: 'mistral',
prompt,
stream: true,
}),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(Boolean);
for (const line of lines) {
const data = JSON.parse(line);
if (data.response) yield data.response;
if (data.done) return;
}
}
}
Performance: Local vs Cloud
Running on Apple M2 Pro (16GB RAM):
| Model | Speed (tok/s) | Latency (first token) | Quality | |-------|--------------|----------------------|---------| | mistral:q4_k_m | ~35 | ~1.2s | Good | | codellama:q4 | ~30 | ~1.4s | Excellent (code) | | gpt-4o (cloud) | ~80 | ~0.8s | Best | | claude-3-5-sonnet | ~75 | ~1.0s | Best |
For most daily tasks — quick code completions, explaining functions, generating test stubs — the local models are fast enough. Cloud models are reserved for complex architecture questions.
Handling the Offline/Online Transition
One tricky edge case: what happens when you go offline mid-session?
class ConnectionMonitor {
private isOnline = true;
constructor() {
// Poll every 10 seconds
setInterval(async () => {
const newStatus = await this.checkConnectivity();
if (this.isOnline !== newStatus) {
this.isOnline = newStatus;
vscode.window.showInformationMessage(
newStatus
? '🟢 JarvisX: Back online — cloud models available'
: '🔴 JarvisX: Offline — switched to local models'
);
modelRouter.notifyConnectivityChange(newStatus);
}
}, 10_000);
}
private async checkConnectivity(): Promise<boolean> {
try {
await fetch('https://1.1.1.1', { signal: AbortSignal.timeout(2000) });
return true;
} catch {
return false;
}
}
}
The Privacy Guarantee
When running fully local:
- ✅ No outbound network requests for inference
- ✅ Models stored locally at
~/.ollama/models/ - ✅ Memory/context stored in local SQLite
- ✅ No telemetry (I disabled it in the extension)
I validated this using lsof and Wireshark to confirm zero external connections during a local-only session.
Lessons Learned
- Model loading time — on startup, preload the models you use most. Cold start is 2–4 seconds.
- Quantization level matters — Q4_K_M is the sweet spot. Don't go below Q4 for quality-sensitive tasks.
- Context length vs speed — longer contexts slow inference significantly. Trim aggressively.
- Extension UX is hard — VS Code's WebviewPanel API has quirks. Budget extra time for the UI.
- Users need status indicators — always show which model is active and whether you're in offline mode.
Want to try it? JarvisX on GitHub | My Portfolio