Skip to main content
Guide
AIJune 23, 20269 min read

Building a Cost-Efficient AI-Powered SaaS MVP: Next.js 16 Edge Functions with Local Model Inference

Slash AI API costs for your SaaS MVP. Learn to leverage Next.js Edge Functions with local model inference (Ollama, Transformers.js) for faster, cheaper, and more private AI features.


TL;DR

  • Traditional LLM API costs (OpenAI, Anthropic) kill MVP margins at scale. Stop bleeding cash.
  • Leverage Next.js Edge Functions with local inference (e.g., ollama or @xenova/transformers.js) for massive cost savings.
  • This architecture delivers lower latency, better data privacy, and full control over your AI stack.
  • Use Edge Functions to proxy requests to a dedicated ollama server or run smaller models directly on the edge.
  • Build powerful, profitable AI features for your SaaS MVP without relying on expensive third-party tokens.

Why Your AI MVP Is Bleeding Cash (and How to Fix It)

You've got a brilliant AI-powered SaaS idea. Great. Now check your anticipated API bill. For many founders, especially early-stage MVPs, per-token pricing from major LLM providers becomes a silent killer. Every user interaction, every content generation, every API call has a price tag. And it adds up fast.

Scaling an AI product on third-party APIs often means scaling your costs linearly, or worse, quadratically due to prompt engineering complexities and retries. This hits your margins hard, making profitability an uphill battle. You're paying for someone else's infrastructure, someone else's data, and someone else's business model.

The fix isn't complicated: take control.

Next.js Edge: The Playground for Local AI Inference

Next.js App Router, combined with Edge Functions, offers a potent environment for deploying AI capabilities efficiently. Edge Functions are serverless, distributed globally, and execute close to your users, offering minimal latency. They're perfect for lightweight compute tasks, and with a clever architecture, they can front-end even complex AI operations.

The core idea is to move from "pay-per-token" to "pay-for-compute." Instead of sending your data to an external API and paying for every byte, you run the models yourself.

We're exploring two primary strategies for integrating local AI inference with Next.js Edge:

  1. Direct Edge Inference: Running smaller, specialized models directly within the Edge Function runtime.
  2. Edge as a Proxy: Using Edge Functions to securely proxy requests to a dedicated inference server running larger local models.

Option 1: Direct Edge Inference with Transformers.js

For tasks like sentiment analysis, basic text classification, summarization of short texts, or generating embeddings, you don't always need a massive LLM. Libraries like Hugging Face's transformers.js (via @xenova/transformers) allow you to run smaller, optimized models directly within browser environments or, crucially, within Vercel's Edge Functions.

This means zero external API calls for these specific tasks.

// app/api/sentiment/route.ts
import { pipeline } from '@xenova/transformers';
import { NextResponse } from 'next/server';

export const runtime = 'edge'; // Crucial for Edge Function deployment

let classifier: any; // Cache the pipeline for subsequent requests

export async function POST(request: Request) {
  try {
    const { text } = await request.json();

    if (!text) {
      return NextResponse.json({ error: 'Text is required' }, { status: 400 });
    }

    if (!classifier) {
      // Initialize the pipeline only once on cold start
      // This model is small enough for Edge deployment
      classifier = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
    }

    const output = await classifier(text);
    return NextResponse.json({ sentiment: output[0].label, score: output[0].score });

  } catch (error) {
    console.error('Sentiment analysis failed:', error);
    return NextResponse.json({ error: 'Internal Server Error' }, { status: 500 });
  }
}

This setup is incredibly cost-efficient. You're only paying for the Edge Function execution time, which is usually in milliseconds per request. No token fees.

Option 2: Next.js Edge as a Proxy to a Local Inference Server

Edge Functions have resource limits (memory, CPU). You won't be running Llama 3 70B directly on them. This is where a dedicated inference server comes in. Tools like ollama (which leverages llama.cpp) make it incredibly simple to run large language models on your own hardware (a VPS, a dedicated server, or even a robust cloud instance with GPUs).

The architecture looks like this:

Client (Browser) -> Next.js Edge Function (API Route) -> Your Ollama Server -> Local LLM

Why this proxy?

  • Security: Your ollama server doesn't need to be publicly exposed. The Edge Function acts as a secure intermediary.
  • Authentication & Authorization: Implement your own logic within the Edge Function before hitting the LLM.
  • Orchestration: Pre-process prompts, post-process responses, or chain multiple local models.
  • Cost Control: You own the compute. Pay for the server instance, not per token.

Here's an example of an Edge Function proxying a request to an ollama server:

// app/api/ollama-proxy/route.ts
import { NextResponse } from 'next/server';

export const runtime = 'edge';

// Replace with your internal Ollama server URL
const OLLAMA_SERVER_URL = process.env.OLLAMA_SERVER_URL || 'http://localhost:11434';

export async function POST(request: Request) {
  try {
    const { prompt, model = 'llama3' } = await request.json();

    if (!prompt) {
      return NextResponse.json({ error: 'Prompt is required' }, { status: 400 });
    }

    const ollamaResponse = await fetch(`${OLLAMA_SERVER_URL}/api/generate`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model,
        prompt,
        stream: false, // For simpler responses, set to true for streaming
      }),
    });

    if (!ollamaResponse.ok) {
      const errorText = await ollamaResponse.text();
      console.error('Ollama server error:', ollamaResponse.status, errorText);
      return NextResponse.json({ error: `Ollama server error: ${errorText}` }, { status: ollamaResponse.status });
    }

    const data = await ollamaResponse.json();
    return NextResponse.json(data);

  } catch (error) {
    console.error('Edge function proxy failed:', error);
    return NextResponse.json({ error: 'Internal Server Error' }, { status: 500 });
  }
}

This Edge route acts as a gateway. It's fast, authenticated, and keeps your local ollama endpoint private. This architecture is a game-changer for building AI features at scale within your SaaS MVP. Need help architecting this for your specific use case? Makershot offers dedicated AI Feature Integration services.

Building a Practical Example: AI Content Rewriter

Let's tie it all together with a practical use case: an AI-powered content rewriter. This could be a core feature for a content marketing SaaS or a writing assistant.

Setting up Your Edge Route for Ollama Proxy

First, define an Edge API route that takes user input, crafts a prompt for your local LLM, and proxies the request to your ollama server.

// app/api/rewrite/route.ts
import { NextResponse } from 'next/server';

export const runtime = 'edge';
const OLLAMA_SERVER_URL = process.env.OLLAMA_SERVER_URL || 'http://localhost:11434'; // Ensure this points to your Ollama server

export async function POST(request: Request) {
  try {
    const { text, tone = 'professional' } = await request.json();

    if (!text) {
      return NextResponse.json({ error: 'Text to rewrite is required' }, { status: 400 });
    }

    const systemPrompt = `You are a helpful AI assistant. Rewrite the following text in a ${tone} tone. Focus on clarity, conciseness, and impact.`;
    const userPrompt = text;

    const ollamaPayload = {
      model: 'llama3', // Or any other model you've loaded in Ollama
      prompt: `${systemPrompt}\n\n${userPrompt}`,
      stream: false,
    };

    const ollamaResponse = await fetch(`${OLLAMA_SERVER_URL}/api/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(ollamaPayload),
    });

    if (!ollamaResponse.ok) {
      const errorBody = await ollamaResponse.text();
      console.error('Ollama rewrite failed:', ollamaResponse.status, errorBody);
      return NextResponse.json({ error: `Failed to rewrite text: ${errorBody}` }, { status: ollamaResponse.status });
    }

    const data = await ollamaResponse.json();
    // Ollama's /api/generate usually returns an object with a 'response' field
    const rewrittenText = data.response;

    return NextResponse.json({ original: text, rewritten: rewrittenText });

  } catch (error) {
    console.error('Content rewrite edge function error:', error);
    return NextResponse.json({ error: 'Internal server error during rewrite' }, { status: 500 });
  }
}

Client-Side Interaction

Now, a simple React component can consume this Edge API route.

// app/page.tsx (or a component)
'use client';

import { useState } from 'react';

export default function ContentRewriter() {
  const [inputText, setInputText] = useState('');
  const [rewrittenText, setRewrittenText] = useState('');
  const [tone, setTone] = useState('professional');
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  const handleRewrite = async () => {
    setLoading(true);
    setError('');
    setRewrittenText('');

    try {
      const response = await fetch('/api/rewrite', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text: inputText, tone }),
      });

      if (!response.ok) {
        const errorData = await response.json();
        throw new Error(errorData.error || 'Failed to rewrite content');
      }

      const data = await response.json();
      setRewrittenText(data.rewritten);

    } catch (err: any) {
      setError(err.message);
    } finally {
      setLoading(false);
    }
  };

  return (
    <div className="max-w-2xl mx-auto p-4">
      <h2 className="text-2xl font-bold mb-4">AI Content Rewriter</h2>
      <div className="mb-4">
        <label htmlFor="tone-select" className="block text-sm font-medium text-gray-700">Tone:</label>
        <select
          id="tone-select"
          className="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-500 focus:ring-indigo-500 sm:text-sm p-2"
          value={tone}
          onChange={(e) => setTone(e.target.value)}
          disabled={loading}
        >
          <option value="professional">Professional</option>
          <option value="casual">Casual</option>
          <option value="academic">Academic</option>
          <option value="concise">Concise</option>
        </select>
      </div>
      <textarea
        className="w-full p-3 border rounded-md shadow-sm mb-4"
        rows={6}
        placeholder="Enter text to rewrite..."
        value={inputText}
        onChange={(e) => setInputText(e.target.value)}
        disabled={loading}
      />
      <button
        onClick={handleRewrite}
        className="px-6 py-3 bg-blue-600 text-white font-semibold rounded-md shadow hover:bg-blue-700 disabled:opacity-50"
        disabled={loading || !inputText.trim()}
      >
        {loading ? 'Rewriting...' : 'Rewrite Content'}
      </button>

      {error && <p className="text-red-500 mt-4">{error}</p>}

      {rewrittenText && (
        <div className="mt-6 p-4 bg-gray-50 border rounded-md">
          <h3 className="text-lg font-semibold mb-2">Rewritten Content:</h3>
          <p className="whitespace-pre-wrap">{rewrittenText}</p>
        </div>
      )}
    </div>
  );
}

This simple application demonstrates how to create a complete AI feature using Next.js Edge and a local LLM, all within your control.

Cost-Efficiency Deep Dive: The Numbers Don't Lie

Let's get specific. A common pricing for a powerful LLM might be $0.0005 per 1,000 input tokens and $0.0015 per 1,000 output tokens. A typical rewrite or summarization might be ~2,000 input tokens and ~1,000 output tokens. That's (2 * 0.0005) + (1 * 0.0015) = $0.001 + $0.0015 = $0.0025 per request.

Sounds small, right? Now, scale it.

  • 10,000 requests/day: $25/day or $750/month.
  • 100,000 requests/day: $250/day or $7,500/month.

For an MVP, $7,500/month just for AI inferences is a massive burn.

Compare that to self-hosting: A cloud GPU instance capable of running llama3 8B might cost $0.50 - $1.00/hour. Let's say $0.75/hour.

  • 24/7 operation: $0.75 * 24 * 30 = $540/month.

You pay a flat fee for the server, irrespective of how many tokens it processes. The cost-per-request effectively drops to near zero as usage increases. For high-volume AI features, this is the only sustainable model.

Performance & Scalability: Don't Compromise

Running models locally or at the edge improves latency because you're avoiding external API network hops. However, scalability requires thought:

  • Edge Function Cold Starts: While transformers.js models can be cached, the first request to a new Edge instance will incur a cold start. Vercel optimizes this, but it's a factor.
  • Ollama Server Scalability: Your dedicated ollama server will eventually become a bottleneck. Scale horizontally by running multiple ollama instances behind a load balancer. Each instance handles requests for a subset of your users.
  • Model Optimization: Use quantized models (e.g., GGUF versions for llama.cpp compatible tools like ollama) to reduce memory footprint and increase inference speed.

This isn't about cutting corners; it's about building a robust, cost-effective AI foundation from day one.

Want this built for you?

Architecting and implementing performant, cost-efficient AI features for your SaaS MVP is complex. If you're looking to integrate AI capabilities without the heavy lifting or crippling API costs, Makershot specializes in AI Feature Integration. We build these systems from the ground up, ensuring your product is scalable, secure, and profitable. Let's make your AI vision a reality, efficiently.