Lessons in Building Scalable AI SaaS Platforms

TL;DR: Building AutomateLanka — a full automation SaaS platform similar to n8n — taught me more about AI SaaS architecture in 6 months than I could have learned from any course. Here are the lessons that matter most.

What "Scalable AI SaaS" Actually Means

Before diving in, let's define the term. A scalable AI SaaS platform needs to:

Execute AI workflows reliably at volume, not just one-at-a-time
Isolate tenant data and compute — one customer's heavy workflow shouldn't starve others
Handle failure gracefully — AI calls fail; workflow execution must recover
Stay cost-efficient — AI inference is expensive; you need smart optimization
Operate transparently — customers need visibility into what's running

These requirements drive completely different architecture decisions than a standard CRUD SaaS.

Lesson 1: Queue Everything That Runs AI

The biggest mistake in AI SaaS: calling AI APIs directly from HTTP request handlers.

// ❌ WRONG: Direct AI api call in request handler
router.post('/workflows/:id/run', async (req, res) => {
  const result = await runWorkflow(workflowId); // Could take 10+ seconds
  return res.json(result); // HTTP timeout if AI is slow
});

// ✅ RIGHT: Queue the work, return immediately
router.post('/workflows/:id/run', async (req, res) => {
  const job = await workflowQueue.add('execute', { 
    workflowId, 
    tenantId: req.tenant.id,
    triggeredBy: req.user.id,
  });
  
  return res.json({ 
    jobId: job.id,
    status: 'queued',
    trackingUrl: `/api/jobs/${job.id}/status`
  });
});

I use BullMQ on top of Redis for workflow execution queuing:

// Queue setup
import { Queue, Worker } from 'bullmq';

export const workflowQueue = new Queue('workflows', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
    removeOnComplete: { count: 1000 },
    removeOnFail: { count: 5000 },
  },
});

// Worker: separate process, scales independently
const worker = new Worker('workflows', async (job) => {
  const { workflowId, tenantId } = job.data;
  
  return await executeWorkflow(workflowId, tenantId, {
    onProgress: (progress) => job.updateProgress(progress),
  });
}, {
  connection: redisConnection,
  concurrency: 10, // Process 10 workflows simultaneously per worker
  limiter: {
    max: 100,       // Max 100 jobs per
    duration: 60000 // 60 seconds (tenant rate limiting handled separately)
  }
});

Lesson 2: Tenant-Level Rate Limiting from Day One

Without per-tenant rate limiting, one power user can exhaust your AI API budget and degrade service for everyone.

// Tenant-aware rate limiter with BullMQ
async function addToTenantQueue(
  tenantId: string,
  workflowId: string,
  priority: number
) {
  // Rate limit: max 10 concurrent executions per tenant
  const tenantConcurrency = await getTenantConcurrency(tenantId);
  
  if (tenantConcurrency >= TENANT_LIMITS[tenant.plan].maxConcurrent) {
    throw new PlanLimitError('Maximum concurrent workflows reached');
  }
  
  return workflowQueue.add('execute', { workflowId, tenantId }, {
    priority,  // Enterprise plans get higher priority
    jobId: `${tenantId}-${workflowId}-${Date.now()}`,
  });
}

// Plan limits
const TENANT_LIMITS = {
  FREE:       { maxConcurrent: 1,  monthlyRuns: 100,   aiTokens: 50_000 },
  PRO:        { maxConcurrent: 5,  monthlyRuns: 5_000, aiTokens: 1_000_000 },
  ENTERPRISE: { maxConcurrent: 20, monthlyRuns: -1,    aiTokens: -1 },
};

Lesson 3: Design for Partial Failure

AI workflows have many failure points. Each step can fail independently:

Node 1 (HTTP Request) → ✅ Success
Node 2 (AI Enrichment) → ❌ OpenAI Rate Limit
Node 3 (Database Write) → Never executed
Node 4 (Notification) → Never executed

The system must:

Record exactly which step failed and why
Allow resuming from the failure point after fixing the issue
Not double-execute successfully completed steps

// Checkpoint-based execution
async function executeWorkflow(workflowId: string, tenantId: string) {
  const workflow = await getWorkflow(workflowId);
  const execution = await createExecution(workflowId);
  
  for (const node of workflow.nodes) {
    // Skip already-completed nodes (supports resume)
    const checkpoint = await getCheckpoint(execution.id, node.id);
    if (checkpoint?.status === 'completed') {
      previousOutput = checkpoint.output;
      continue;
    }
    
    try {
      const output = await executeNode(node, previousOutput, tenantId);
      
      // Save checkpoint immediately after success
      await saveCheckpoint(execution.id, node.id, 'completed', output);
      
      previousOutput = output;
      
    } catch (error) {
      await saveCheckpoint(execution.id, node.id, 'failed', null, error.message);
      await updateExecution(execution.id, 'failed', { failedAt: node.id, error: error.message });
      throw error; // Let BullMQ handle retry
    }
  }
  
  await updateExecution(execution.id, 'completed');
}

Lesson 4: AI Token Usage is Your COGS — Track It Religiously

Unlike compute costs that scale predictably, AI token costs are chaotic. One poorly controlled workflow can incur $50 in API costs overnight.

// Token usage tracking middleware
async function trackableOpenAICall(
  tenantId: string,
  params: OpenAI.ChatCompletionCreateParams
): Promise<OpenAI.ChatCompletion> {
  
  // Pre-flight: check tenant has remaining token budget
  const remaining = await getRemainingTokenBudget(tenantId);
  const estimated = estimateTokens(params);
  
  if (estimated > remaining) {
    throw new PlanLimitError('AI token budget exhausted for this billing period');
  }
  
  const start = Date.now();
  const response = await openai.chat.completions.create(params);
  const latency = Date.now() - start;
  
  // Track actual usage
  const { prompt_tokens, completion_tokens, total_tokens } = response.usage!;
  
  await prisma.aiUsageLog.create({
    data: {
      tenantId,
      model: params.model,
      promptTokens: prompt_tokens,
      completionTokens: completion_tokens,
      totalTokens: total_tokens,
      estimatedCost: calculateCost(params.model, total_tokens),
      latencyMs: latency,
      timestamp: new Date(),
    }
  });
  
  // Update tenant running total
  await incrementTokenUsage(tenantId, total_tokens);
  
  return response;
}

Lesson 5: Semantic Search Changes How Users Interact

AutomateLanka has 200+ built-in workflow templates. With keyword search, users struggle to find what they need. Semantic search made this dramatically better:

User types: "when a customer pays, send them a welcome email and update spreadsheet"
Keyword search: 0 results matching all keywords
Semantic search: Finds "Payment Confirmation → CRM Update → Email Notification" template

The semantic search index refreshes incrementally as new templates are added:

// Index workflow templates for semantic search (Xenova transformers — runs in-browser/Node.js, no API needed)
import { pipeline } from '@xenova/transformers';

const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

async function indexWorkflowTemplate(template: WorkflowTemplate) {
  const text = `${template.name} ${template.description} ${template.tags.join(' ')}
                Use cases: ${template.useCases.join('. ')}`;
  
  const output = await embedder(text, { pooling: 'mean', normalize: true });
  const embedding = Array.from(output.data);
  
  await prisma.$executeRaw`
    UPDATE workflow_templates 
    SET embedding = ${embedding}::vector
    WHERE id = ${template.id}
  `;
}

Important: I used Xenova (a JS/Node.js port of HuggingFace transformers) to run embeddings locally — zero cost, zero latency, no API key needed. For a SaaS platform serving millions of embed calls, this saves significant money.

Lesson 6: Observability is Non-Negotiable

Without good observability, debugging production issues in AI workflows is a nightmare.

Everything goes to a structured log:

// Structured log for every workflow step
await logger.info('workflow.node.executed', {
  tenantId,
  workflowId,
  executionId,
  nodeId,
  nodeType,
  durationMs,
  tokenUsage: { prompt, completion, total },
  inputSize: JSON.stringify(input).length,
  outputSize: JSON.stringify(output).length,
  success: true,
});

await logger.error('workflow.node.failed', {
  tenantId,
  workflowId,
  nodeId,
  error: error.message,
  errorCode: error.code,
  willRetry: attempt < maxAttempts,
  attempt,
});

The tenant-facing execution dashboard shows exactly which nodes ran, how long each took, and what failed — giving customers visibility and reducing support tickets dramatically.

Lesson 7: Build the Admin Panel Early

As a multi-tenant platform, you need operational tooling for yourself:

View all tenant executions and search by error
Manually retry failed jobs
Adjust plan limits without code changes
View AI cost breakdown by tenant
Kill runaway workflows

I spent two weeks on the admin panel early and recovered those two weeks within a month through faster debugging.

The Architecture That Emerged

Next.js Frontend (Vercel Edge)
         │
    REST + WebSocket
         │
Node.js API Server (Railway)
    │           │
    │           └── BullMQ + Redis (Upstash)
    │                    │
    │               Worker Processes
    │               (AI execution)
    │
PostgreSQL + pgvector (Supabase)

Critically: the API server and the workers are separate processes. Workers scale independently based on queue depth — you don't need to scale the API server when workflow volume peaks.

Summary: The Key Decisions That Mattered

| Decision | Why It Mattered | |----------|----------------| | Queue-first execution | Eliminated timeouts, enabled retry logic | | Per-tenant rate limiting | Prevented cost overruns and abuse | | Checkpoint-based execution | Enabled resume and prevented double-execution | | Token usage tracking | Made costs visible and controllable | | Local embedding model | Zero API cost for semantic search | | Workers as separate processes | Independent scaling, better resource control |

Explore AutomateLanka: Live Demo | GitHub