Client Area

Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)

ByDomain India Team·DomainIndia Engineering
6 min readPublished 24 Apr 2026Updated 4 Jun 2026170 views

In this article

  • 1Why self-host an LLM
  • 2What can a DomainIndia VPS run?
  • 3Option A — Ollama (easiest)
  • 4systemd + nginx front
  • 5Option B — vLLM (production throughput)

Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)

TL;DR
Self-host open-source LLMs (Llama 3, Mistral, Qwen) on a DomainIndia VPS — keep data private, no per-token costs, tune models to your domain. This guide covers VPS sizing, Ollama for easy setup, vLLM for production throughput, and the CPU vs GPU trade-off for small models.

Why self-host an LLM

Paid APIs (OpenAI, Claude) are fast and capable, but self-hosted models win when:

  • Privacy — customer data, medical records, legal contracts never leave your VPS
  • Cost at scale — past ~1M tokens/day, self-hosted wins even with GPU rental
  • Customisation — fine-tune on your domain vocabulary
  • Offline / sovereign — no internet required, no US-company dependency
  • Latency — local inference = 50ms not 500ms round trip

Trade-offs:

  • Quality gap — open models are 6–12 months behind GPT-5 / Claude Opus
  • Ops burden — you manage updates, downtime, scaling
  • Upfront work to size + tune

What can a DomainIndia VPS run?

VPSRAMCPU inference modelSpeedUse case
Starter (2 GB)2 GBPhi-3-mini (3.8B, 4-bit)8–12 tok/secChatbot, classification
Business (4 GB)4 GBMistral 7B (4-bit)4–6 tok/secSummarisation, Q&A
Enterprise (8 GB)8 GBLlama 3 8B (4-bit), Qwen 2.5 7B3–5 tok/secProduction RAG
16 GB+16 GBMistral Small 24B (4-bit), Llama 3 70B (2-bit)1–3 tok/secAdvanced reasoning

Rule of thumb: 4-bit quantized model needs ~(model size in B params × 0.6) GB RAM. Llama 3 8B fits in ~5 GB; 70B needs 36+ GB.

Insight

CPU inference is slower but completely viable for many use cases. A chatbot that responds in 3 seconds instead of 1 second is still usable. For batch workloads (nightly summarisation, classification), speed matters less.

Option A — Ollama (easiest)

Ollama wraps model downloading, serving, and a REST API in one binary.

  1. SSH into VPS as root
  2. Install:

```bash

curl -fsSL https://ollama.com/install.sh | sh

```

  1. Download a model:

```bash

ollama pull llama3.2:3b # 2 GB, fast

ollama pull mistral:7b # 4 GB, quality

ollama pull qwen2.5:14b-q4 # 8 GB, strong

```

  1. Test:

```bash

ollama run mistral "Explain how TCP handshake works in 3 sentences."

```

  1. Expose as REST API (Ollama auto-serves on localhost:11434):

```bash

curl http://localhost:11434/api/generate -d '{

"model": "mistral",

"prompt": "Summarise: ..."

}'

```

systemd + nginx front

/etc/systemd/system/ollama.service is auto-installed by the setup script. To expose externally:

nginx
server {
    listen 443 ssl;
    server_name llm.yourcompany.com;

    # Auth — protect the endpoint!
    location / {
        auth_basic "LLM";
        auth_basic_user_file /etc/nginx/.htpasswd-llm;
        proxy_pass http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
    }
}
Warning

Never expose Ollama unauthenticated to the internet. Bots scrape models, abuse GPU, drain your bandwidth. Always nginx + basic auth or JWT.

Option B — vLLM (production throughput)

vLLM is built for serving many concurrent requests efficiently with continuous batching. Much higher throughput than Ollama at scale.

Requires GPU ideally (vLLM has CPU support but rarely used in production).

bash
pip install vllm

# Start server — OpenAI-compatible API
vllm serve meta-llama/Llama-3.2-3B-Instruct 
    --port 8000 
    --max-model-len 8192 
    --dtype auto

Then call with any OpenAI SDK:

python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Drop-in compatible — rest of your app doesn't know it's self-hosted.

Option C — llama.cpp (bare metal, CPU-optimised)

For squeezing every bit of CPU performance:

bash
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j $(nproc) LLAMA_AVX2=1

# Download a GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Serve
./server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080

Lowest overhead, highest tokens/sec on CPU. Ollama actually wraps llama.cpp — use Ollama for convenience, bare llama.cpp when you need max performance.

Picking a model (2026 options)

ModelSizeStrengthsWeaknesses
Llama 3.2 3B2 GB (4-bit)Fast, good EnglishWeaker at code, non-English
Mistral 7B4 GBBalanced, good reasoningMediocre Indian languages
Qwen 2.5 7B5 GBExcellent multilingual, codeLarger than Mistral
Llama 3 8B5 GBBest general-purposeSame tier as Mistral+Qwen
Phi-3.5 mini 3.8B2.5 GBTiny + punches above weightLimited factual knowledge
Nemotron 70B (2-bit)16 GBGPT-4-class qualitySlow, large

For Indian language support: Qwen 2.5 or Sarvam models (IIT-made, Hindi/Tamil/Telugu native).

Production patterns

Pattern 1 — API key gateway

Don't expose your LLM server directly. Put your own API in front that:

  • Validates API keys (not basic auth)
  • Rate limits per key
  • Logs usage for billing
  • Falls back to paid API if self-hosted is down
python
# FastAPI gateway example
from fastapi import FastAPI, HTTPException, Depends, Header
import httpx

app = FastAPI()

async def validate_key(x_api_key: str = Header()):
    if not await redis.sismember('valid_keys', x_api_key):
        raise HTTPException(401, 'Invalid key')
    return x_api_key

@app.post("/v1/chat/completions")
async def chat(req: dict, api_key: str = Depends(validate_key)):
    # Rate limit
    count = await redis.incr(f'ratelimit:{api_key}:{now_minute}')
    if count > 60: raise HTTPException(429, 'Rate limit')

    # Forward to local LLM
    async with httpx.AsyncClient() as client:
        resp = await client.post("http://localhost:11434/v1/chat/completions", json=req)
    return resp.json()

Pattern 2 — Model warmup

Cold start on a model is 10–30 seconds. Keep it loaded:

bash
# Ping every 2 minutes to keep model in RAM
*/2 * * * * curl -s http://localhost:11434/api/generate -d '{"model":"mistral","prompt":"hi","stream":false}' > /dev/null

Pattern 3 — Fallback chain

python
try:
    return await ollama_chat(prompt)   # self-hosted, free
except (TimeoutError, HTTPError):
    return await openai_chat(prompt)    # paid fallback

Free by default, pay only during peaks or outages.

Common pitfalls

FAQ

Q Do I need a GPU?

No — CPU works for small models (3B–7B). Speed is 3–12 tok/sec which is usable for most chatbots. GPU only if you need >20 tok/sec or big models (>13B).

Q How much does a self-hosted LLM cost vs OpenAI?

Break-even for Mistral 7B + typical usage is around 500K–1M requests/month. Below that, paid APIs win. Above, self-hosted wins — especially with Enterprise VPS (₹10,000/mo flat fee vs $500+ OpenAI spend).

Q Can I fine-tune on my data?

Yes — LoRA fine-tuning runs on CPU (slow, ~days) or GPU (hours). Start with RAG before fine-tuning — cheaper, often enough.

Q Llama 3 or Mistral or Qwen?

Llama 3 for general English. Mistral 7B for balanced. Qwen 2.5 for multilingual (including Hindi, Tamil). All open-weights.

Q Can Ollama serve multiple models?

Yes — pre-pull them, keep the one you use most warm. Ollama auto-unloads idle models after 5 min. Tune via OLLAMA_KEEP_ALIVE=30m.

Self-host your own LLM on a DomainIndia VPS — privacy + cost control. View VPS plans

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket