Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)

ByDomain India Team·DomainIndia Engineering

6 min readPublished 24 Apr 2026Updated 14 Jul 2026592 views

Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)

TL;DR

Self-host open-source LLMs (Llama 3, Mistral, Qwen) on a DomainIndia VPS — keep data private, no per-token costs, tune models to your domain. This guide covers VPS sizing, Ollama for easy setup, vLLM for production throughput, and the CPU vs GPU trade-off for small models.

Why self-host an LLM

Paid APIs (OpenAI, Claude) are fast and capable, but self-hosted models win when:

Privacy — customer data, medical records, legal contracts never leave your VPS
Cost at scale — past ~1M tokens/day, self-hosted wins even with GPU rental
Customisation — fine-tune on your domain vocabulary
Offline / sovereign — no internet required, no US-company dependency
Latency — local inference = 50ms not 500ms round trip

Trade-offs:

Quality gap — open models are 6–12 months behind GPT-5 / Claude Opus
Ops burden — you manage updates, downtime, scaling
Upfront work to size + tune

What can a DomainIndia VPS run?

VPS	RAM	CPU inference model	Speed	Use case
Starter (2 GB)	2 GB	Phi-3-mini (3.8B, 4-bit)	8–12 tok/sec	Chatbot, classification
Business (4 GB)	4 GB	Mistral 7B (4-bit)	4–6 tok/sec	Summarisation, Q&A
Enterprise (8 GB)	8 GB	Llama 3 8B (4-bit), Qwen 2.5 7B	3–5 tok/sec	Production RAG
16 GB+	16 GB	Mistral Small 24B (4-bit), Llama 3 70B (2-bit)	1–3 tok/sec	Advanced reasoning

Rule of thumb: 4-bit quantized model needs ~(model size in B params × 0.6) GB RAM. Llama 3 8B fits in ~5 GB; 70B needs 36+ GB.

Insight

CPU inference is slower but completely viable for many use cases. A chatbot that responds in 3 seconds instead of 1 second is still usable. For batch workloads (nightly summarisation, classification), speed matters less.

Option A — Ollama (easiest)

Ollama wraps model downloading, serving, and a REST API in one binary.

SSH into VPS as root
Install:

```bash

curl -fsSL https://ollama.com/install.sh | sh

```

Download a model:

```bash

ollama pull llama3.2:3b # 2 GB, fast

ollama pull mistral:7b # 4 GB, quality

ollama pull qwen2.5:14b-q4 # 8 GB, strong

```

Test:

```bash

ollama run mistral "Explain how TCP handshake works in 3 sentences."

```

Expose as REST API (Ollama auto-serves on localhost:11434):

```bash

curl http://localhost:11434/api/generate -d '{

"model": "mistral",

"prompt": "Summarise: ..."

```

systemd + nginx front

/etc/systemd/system/ollama.service is auto-installed by the setup script. To expose externally:

nginx

server {
    listen 443 ssl;
    server_name llm.yourcompany.com;

    # Auth — protect the endpoint!
    location / {
        auth_basic "LLM";
        auth_basic_user_file /etc/nginx/.htpasswd-llm;
        proxy_pass http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
    }
}

Warning

Never expose Ollama unauthenticated to the internet. Bots scrape models, abuse GPU, drain your bandwidth. Always nginx + basic auth or JWT.

Option B — vLLM (production throughput)

vLLM is built for serving many concurrent requests efficiently with continuous batching. Much higher throughput than Ollama at scale.

Requires GPU ideally (vLLM has CPU support but rarely used in production).

bash

pip install vllm

# Start server — OpenAI-compatible API
vllm serve meta-llama/Llama-3.2-3B-Instruct 
    --port 8000 
    --max-model-len 8192 
    --dtype auto

Then call with any OpenAI SDK:

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Drop-in compatible — rest of your app doesn't know it's self-hosted.

Option C — llama.cpp (bare metal, CPU-optimised)

For squeezing every bit of CPU performance:

bash

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j $(nproc) LLAMA_AVX2=1

# Download a GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Serve
./server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080

Lowest overhead, highest tokens/sec on CPU. Ollama actually wraps llama.cpp — use Ollama for convenience, bare llama.cpp when you need max performance.

Picking a model (2026 options)

Model	Size	Strengths	Weaknesses
Llama 3.2 3B	2 GB (4-bit)	Fast, good English	Weaker at code, non-English
Mistral 7B	4 GB	Balanced, good reasoning	Mediocre Indian languages
Qwen 2.5 7B	5 GB	Excellent multilingual, code	Larger than Mistral
Llama 3 8B	5 GB	Best general-purpose	Same tier as Mistral+Qwen
Phi-3.5 mini 3.8B	2.5 GB	Tiny + punches above weight	Limited factual knowledge
Nemotron 70B (2-bit)	16 GB	GPT-4-class quality	Slow, large

For Indian language support: Qwen 2.5 or Sarvam models (IIT-made, Hindi/Tamil/Telugu native).

Production patterns

Pattern 1 — API key gateway

Don't expose your LLM server directly. Put your own API in front that:

Validates API keys (not basic auth)
Rate limits per key
Logs usage for billing
Falls back to paid API if self-hosted is down

python

# FastAPI gateway example
from fastapi import FastAPI, HTTPException, Depends, Header
import httpx

app = FastAPI()

async def validate_key(x_api_key: str = Header()):
    if not await redis.sismember('valid_keys', x_api_key):
        raise HTTPException(401, 'Invalid key')
    return x_api_key

@app.post("/v1/chat/completions")
async def chat(req: dict, api_key: str = Depends(validate_key)):
    # Rate limit
    count = await redis.incr(f'ratelimit:{api_key}:{now_minute}')
    if count > 60: raise HTTPException(429, 'Rate limit')

    # Forward to local LLM
    async with httpx.AsyncClient() as client:
        resp = await client.post("http://localhost:11434/v1/chat/completions", json=req)
    return resp.json()

Pattern 2 — Model warmup

Cold start on a model is 10–30 seconds. Keep it loaded:

bash

# Ping every 2 minutes to keep model in RAM
*/2 * * * * curl -s http://localhost:11434/api/generate -d '{"model":"mistral","prompt":"hi","stream":false}' > /dev/null

Pattern 3 — Fallback chain

python

try:
    return await ollama_chat(prompt)   # self-hosted, free
except (TimeoutError, HTTPError):
    return await openai_chat(prompt)    # paid fallback

Free by default, pay only during peaks or outages.

Common pitfalls

FAQ

Q Do I need a GPU?

No — CPU works for small models (3B–7B). Speed is 3–12 tok/sec which is usable for most chatbots. GPU only if you need >20 tok/sec or big models (>13B).

Q How much does a self-hosted LLM cost vs OpenAI?

Break-even for Mistral 7B + typical usage is around 500K–1M requests/month. Below that, paid APIs win. Above, self-hosted wins — especially with Enterprise VPS (₹10,000/mo flat fee vs $500+ OpenAI spend).

Q Can I fine-tune on my data?

Yes — LoRA fine-tuning runs on CPU (slow, ~days) or GPU (hours). Start with RAG before fine-tuning — cheaper, often enough.

Q Llama 3 or Mistral or Qwen?

Llama 3 for general English. Mistral 7B for balanced. Qwen 2.5 for multilingual (including Hindi, Tamil). All open-weights.

Q Can Ollama serve multiple models?

Yes — pre-pull them, keep the one you use most warm. Ollama auto-unloads idle models after 5 min. Tune via OLLAMA_KEEP_ALIVE=30m.

Self-host your own LLM on a DomainIndia VPS — privacy + cost control. View VPS plans