Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)
Why self-host an LLM
Paid APIs (OpenAI, Claude) are fast and capable, but self-hosted models win when:
- Privacy — customer data, medical records, legal contracts never leave your VPS
- Cost at scale — past ~1M tokens/day, self-hosted wins even with GPU rental
- Customisation — fine-tune on your domain vocabulary
- Offline / sovereign — no internet required, no US-company dependency
- Latency — local inference = 50ms not 500ms round trip
Trade-offs:
- Quality gap — open models are 6–12 months behind GPT-5 / Claude Opus
- Ops burden — you manage updates, downtime, scaling
- Upfront work to size + tune
What can a DomainIndia VPS run?
| VPS | RAM | CPU inference model | Speed | Use case |
|---|---|---|---|---|
| Starter (2 GB) | 2 GB | Phi-3-mini (3.8B, 4-bit) | 8–12 tok/sec | Chatbot, classification |
| Business (4 GB) | 4 GB | Mistral 7B (4-bit) | 4–6 tok/sec | Summarisation, Q&A |
| Enterprise (8 GB) | 8 GB | Llama 3 8B (4-bit), Qwen 2.5 7B | 3–5 tok/sec | Production RAG |
| 16 GB+ | 16 GB | Mistral Small 24B (4-bit), Llama 3 70B (2-bit) | 1–3 tok/sec | Advanced reasoning |
Rule of thumb: 4-bit quantized model needs ~(model size in B params × 0.6) GB RAM. Llama 3 8B fits in ~5 GB; 70B needs 36+ GB.
CPU inference is slower but completely viable for many use cases. A chatbot that responds in 3 seconds instead of 1 second is still usable. For batch workloads (nightly summarisation, classification), speed matters less.
Option A — Ollama (easiest)
Ollama wraps model downloading, serving, and a REST API in one binary.
- SSH into VPS as root
- Install:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
- Download a model:
```bash
ollama pull llama3.2:3b # 2 GB, fast
ollama pull mistral:7b # 4 GB, quality
ollama pull qwen2.5:14b-q4 # 8 GB, strong
```
- Test:
```bash
ollama run mistral "Explain how TCP handshake works in 3 sentences."
```
- Expose as REST API (Ollama auto-serves on localhost:11434):
```bash
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Summarise: ..."
}'
```
systemd + nginx front
/etc/systemd/system/ollama.service is auto-installed by the setup script. To expose externally:
server {
listen 443 ssl;
server_name llm.yourcompany.com;
# Auth — protect the endpoint!
location / {
auth_basic "LLM";
auth_basic_user_file /etc/nginx/.htpasswd-llm;
proxy_pass http://127.0.0.1:11434;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
}
}Never expose Ollama unauthenticated to the internet. Bots scrape models, abuse GPU, drain your bandwidth. Always nginx + basic auth or JWT.
Option B — vLLM (production throughput)
vLLM is built for serving many concurrent requests efficiently with continuous batching. Much higher throughput than Ollama at scale.
Requires GPU ideally (vLLM has CPU support but rarely used in production).
pip install vllm
# Start server — OpenAI-compatible API
vllm serve meta-llama/Llama-3.2-3B-Instruct
--port 8000
--max-model-len 8192
--dtype autoThen call with any OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)Drop-in compatible — rest of your app doesn't know it's self-hosted.
Option C — llama.cpp (bare metal, CPU-optimised)
For squeezing every bit of CPU performance:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j $(nproc) LLAMA_AVX2=1
# Download a GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Serve
./server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080Lowest overhead, highest tokens/sec on CPU. Ollama actually wraps llama.cpp — use Ollama for convenience, bare llama.cpp when you need max performance.
Picking a model (2026 options)
| Model | Size | Strengths | Weaknesses |
|---|---|---|---|
| Llama 3.2 3B | 2 GB (4-bit) | Fast, good English | Weaker at code, non-English |
| Mistral 7B | 4 GB | Balanced, good reasoning | Mediocre Indian languages |
| Qwen 2.5 7B | 5 GB | Excellent multilingual, code | Larger than Mistral |
| Llama 3 8B | 5 GB | Best general-purpose | Same tier as Mistral+Qwen |
| Phi-3.5 mini 3.8B | 2.5 GB | Tiny + punches above weight | Limited factual knowledge |
| Nemotron 70B (2-bit) | 16 GB | GPT-4-class quality | Slow, large |
For Indian language support: Qwen 2.5 or Sarvam models (IIT-made, Hindi/Tamil/Telugu native).
Production patterns
Pattern 1 — API key gateway
Don't expose your LLM server directly. Put your own API in front that:
- Validates API keys (not basic auth)
- Rate limits per key
- Logs usage for billing
- Falls back to paid API if self-hosted is down
# FastAPI gateway example
from fastapi import FastAPI, HTTPException, Depends, Header
import httpx
app = FastAPI()
async def validate_key(x_api_key: str = Header()):
if not await redis.sismember('valid_keys', x_api_key):
raise HTTPException(401, 'Invalid key')
return x_api_key
@app.post("/v1/chat/completions")
async def chat(req: dict, api_key: str = Depends(validate_key)):
# Rate limit
count = await redis.incr(f'ratelimit:{api_key}:{now_minute}')
if count > 60: raise HTTPException(429, 'Rate limit')
# Forward to local LLM
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:11434/v1/chat/completions", json=req)
return resp.json()Pattern 2 — Model warmup
Cold start on a model is 10–30 seconds. Keep it loaded:
# Ping every 2 minutes to keep model in RAM
*/2 * * * * curl -s http://localhost:11434/api/generate -d '{"model":"mistral","prompt":"hi","stream":false}' > /dev/nullPattern 3 — Fallback chain
try:
return await ollama_chat(prompt) # self-hosted, free
except (TimeoutError, HTTPError):
return await openai_chat(prompt) # paid fallbackFree by default, pay only during peaks or outages.
Common pitfalls
FAQ
No — CPU works for small models (3B–7B). Speed is 3–12 tok/sec which is usable for most chatbots. GPU only if you need >20 tok/sec or big models (>13B).
Break-even for Mistral 7B + typical usage is around 500K–1M requests/month. Below that, paid APIs win. Above, self-hosted wins — especially with Enterprise VPS (₹10,000/mo flat fee vs $500+ OpenAI spend).
Yes — LoRA fine-tuning runs on CPU (slow, ~days) or GPU (hours). Start with RAG before fine-tuning — cheaper, often enough.
Llama 3 for general English. Mistral 7B for balanced. Qwen 2.5 for multilingual (including Hindi, Tamil). All open-weights.
Yes — pre-pull them, keep the one you use most warm. Ollama auto-unloads idle models after 5 min. Tune via OLLAMA_KEEP_ALIVE=30m.
Self-host your own LLM on a DomainIndia VPS — privacy + cost control. View VPS plans