# Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)
TL;DR
Self-host open-source LLMs (Llama 3, Mistral, Qwen) on a DomainIndia VPS — keep data private, no per-token costs, tune models to your domain. This guide covers VPS sizing, Ollama for easy setup, vLLM for production throughput, and the CPU vs GPU trade-off for small models.
## Why self-host an LLM
Paid APIs (OpenAI, Claude) are fast and capable, but self-hosted models win when:
- **Privacy** — customer data, medical records, legal contracts never leave your VPS
- **Cost at scale** — past ~1M tokens/day, self-hosted wins even with GPU rental
- **Customisation** — fine-tune on your domain vocabulary
- **Offline / sovereign** — no internet required, no US-company dependency
- **Latency** — local inference = 50ms not 500ms round trip
Trade-offs:
- Quality gap — open models are 6–12 months behind GPT-5 / Claude Opus
- Ops burden — you manage updates, downtime, scaling
- Upfront work to size + tune
## What can a DomainIndia VPS run?
| VPS | RAM | CPU inference model | Speed | Use case |
| Starter (2 GB) | 2 GB | Phi-3-mini (3.8B, 4-bit) | 8–12 tok/sec | Chatbot, classification |
| Business (4 GB) | 4 GB | Mistral 7B (4-bit) | 4–6 tok/sec | Summarisation, Q&A |
| Enterprise (8 GB) | 8 GB | Llama 3 8B (4-bit), Qwen 2.5 7B | 3–5 tok/sec | Production RAG |
| 16 GB+ | 16 GB | Mistral Small 24B (4-bit), Llama 3 70B (2-bit) | 1–3 tok/sec | Advanced reasoning |
**Rule of thumb:** 4-bit quantized model needs ~(model size in B params × 0.6) GB RAM. Llama 3 8B fits in ~5 GB; 70B needs 36+ GB.
Insight
CPU inference is slower but completely viable for many use cases. A chatbot that responds in 3 seconds instead of 1 second is still usable. For batch workloads (nightly summarisation, classification), speed matters less.
## Option A — Ollama (easiest)
Ollama wraps model downloading, serving, and a REST API in one binary.
5
Expose as REST API (Ollama auto-serves on localhost:11434):
### systemd + nginx front
`/etc/systemd/system/ollama.service` is auto-installed by the setup script. To expose externally:
```nginx
server {
listen 443 ssl;
server_name llm.yourcompany.com;
# Auth — protect the endpoint!
location / {
auth_basic "LLM";
auth_basic_user_file /etc/nginx/.htpasswd-llm;
proxy_pass http://127.0.0.1:11434;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
}
}
```
Warning
Never expose Ollama unauthenticated to the internet. Bots scrape models, abuse GPU, drain your bandwidth. Always nginx + basic auth or JWT.
## Option B — vLLM (production throughput)
vLLM is built for serving many concurrent requests efficiently with continuous batching. Much higher throughput than Ollama at scale.
Requires GPU ideally (vLLM has CPU support but rarely used in production).
```bash
pip install vllm
# Start server — OpenAI-compatible API
vllm serve meta-llama/Llama-3.2-3B-Instruct
--port 8000
--max-model-len 8192
--dtype auto
```
Then call with any OpenAI SDK:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
```
Drop-in compatible — rest of your app doesn't know it's self-hosted.
## Option C — llama.cpp (bare metal, CPU-optimised)
For squeezing every bit of CPU performance:
```bash
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j $(nproc) LLAMA_AVX2=1
# Download a GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Serve
./server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080
```
Lowest overhead, highest tokens/sec on CPU. Ollama actually wraps llama.cpp — use Ollama for convenience, bare llama.cpp when you need max performance.
## Picking a model (2026 options)
| Model | Size | Strengths | Weaknesses |
| Llama 3.2 3B | 2 GB (4-bit) | Fast, good English | Weaker at code, non-English |
| Mistral 7B | 4 GB | Balanced, good reasoning | Mediocre Indian languages |
| Qwen 2.5 7B | 5 GB | Excellent multilingual, code | Larger than Mistral |
| Llama 3 8B | 5 GB | Best general-purpose | Same tier as Mistral+Qwen |
| Phi-3.5 mini 3.8B | 2.5 GB | Tiny + punches above weight | Limited factual knowledge |
| Nemotron 70B (2-bit) | 16 GB | GPT-4-class quality | Slow, large |
For Indian language support: **Qwen 2.5** or **Sarvam models** (IIT-made, Hindi/Tamil/Telugu native).
## Production patterns
### Pattern 1 — API key gateway
Don't expose your LLM server directly. Put your own API in front that:
- Validates API keys (not basic auth)
- Rate limits per key
- Logs usage for billing
- Falls back to paid API if self-hosted is down
```python
# FastAPI gateway example
from fastapi import FastAPI, HTTPException, Depends, Header
import httpx
app = FastAPI()
async def validate_key(x_api_key: str = Header()):
if not await redis.sismember('valid_keys', x_api_key):
raise HTTPException(401, 'Invalid key')
return x_api_key
@app.post("/v1/chat/completions")
async def chat(req: dict, api_key: str = Depends(validate_key)):
# Rate limit
count = await redis.incr(f'ratelimit:{api_key}:{now_minute}')
if count > 60: raise HTTPException(429, 'Rate limit')
# Forward to local LLM
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:11434/v1/chat/completions", json=req)
return resp.json()
```
### Pattern 2 — Model warmup
Cold start on a model is 10–30 seconds. Keep it loaded:
```bash
# Ping every 2 minutes to keep model in RAM
*/2 * * * * curl -s http://localhost:11434/api/generate -d '{"model":"mistral","prompt":"hi","stream":false}' > /dev/null
```
### Pattern 3 — Fallback chain
```python
try:
return await ollama_chat(prompt) # self-hosted, free
except (TimeoutError, HTTPError):
return await openai_chat(prompt) # paid fallback
```
Free by default, pay only during peaks or outages.
## Common pitfalls
## FAQ
Q
Do I need a GPU?
No — CPU works for small models (3B–7B). Speed is 3–12 tok/sec which is usable for most chatbots. GPU only if you need >20 tok/sec or big models (>13B).
Q
How much does a self-hosted LLM cost vs OpenAI?
Break-even for Mistral 7B + typical usage is around 500K–1M requests/month. Below that, paid APIs win. Above, self-hosted wins — especially with Enterprise VPS (₹10,000/mo flat fee vs $500+ OpenAI spend).
Q
Can I fine-tune on my data?
Yes — LoRA fine-tuning runs on CPU (slow, ~days) or GPU (hours). Start with RAG before fine-tuning — cheaper, often enough.
Q
Llama 3 or Mistral or Qwen?
Llama 3 for general English. Mistral 7B for balanced. Qwen 2.5 for multilingual (including Hindi, Tamil). All open-weights.
Q
Can Ollama serve multiple models?
Yes — pre-pull them, keep the one you use most warm. Ollama auto-unloads idle models after 5 min. Tune via OLLAMA_KEEP_ALIVE=30m.
Self-host your own LLM on a DomainIndia VPS — privacy + cost control.
View VPS plans