Client Area

Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)

ByDomain India Team·DomainIndia Engineering
6 min read24 Apr 20263 views
# Self-Hosting Llama 3 and Mistral LLMs on DomainIndia VPS (Ollama + vLLM)
TL;DR
Self-host open-source LLMs (Llama 3, Mistral, Qwen) on a DomainIndia VPS — keep data private, no per-token costs, tune models to your domain. This guide covers VPS sizing, Ollama for easy setup, vLLM for production throughput, and the CPU vs GPU trade-off for small models.
## Why self-host an LLM Paid APIs (OpenAI, Claude) are fast and capable, but self-hosted models win when: - **Privacy** — customer data, medical records, legal contracts never leave your VPS - **Cost at scale** — past ~1M tokens/day, self-hosted wins even with GPU rental - **Customisation** — fine-tune on your domain vocabulary - **Offline / sovereign** — no internet required, no US-company dependency - **Latency** — local inference = 50ms not 500ms round trip Trade-offs: - Quality gap — open models are 6–12 months behind GPT-5 / Claude Opus - Ops burden — you manage updates, downtime, scaling - Upfront work to size + tune ## What can a DomainIndia VPS run?
VPSRAMCPU inference modelSpeedUse case
Starter (2 GB)2 GBPhi-3-mini (3.8B, 4-bit)8–12 tok/secChatbot, classification
Business (4 GB)4 GBMistral 7B (4-bit)4–6 tok/secSummarisation, Q&A
Enterprise (8 GB)8 GBLlama 3 8B (4-bit), Qwen 2.5 7B3–5 tok/secProduction RAG
16 GB+16 GBMistral Small 24B (4-bit), Llama 3 70B (2-bit)1–3 tok/secAdvanced reasoning
**Rule of thumb:** 4-bit quantized model needs ~(model size in B params × 0.6) GB RAM. Llama 3 8B fits in ~5 GB; 70B needs 36+ GB.
Insight

CPU inference is slower but completely viable for many use cases. A chatbot that responds in 3 seconds instead of 1 second is still usable. For batch workloads (nightly summarisation, classification), speed matters less.

## Option A — Ollama (easiest) Ollama wraps model downloading, serving, and a REST API in one binary.
1
SSH into VPS as root
2
Install:
3
Download a model:
4
Test:
5
Expose as REST API (Ollama auto-serves on localhost:11434):
### systemd + nginx front `/etc/systemd/system/ollama.service` is auto-installed by the setup script. To expose externally: ```nginx server { listen 443 ssl; server_name llm.yourcompany.com; # Auth — protect the endpoint! location / { auth_basic "LLM"; auth_basic_user_file /etc/nginx/.htpasswd-llm; proxy_pass http://127.0.0.1:11434; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_read_timeout 300s; } } ```
Warning

Never expose Ollama unauthenticated to the internet. Bots scrape models, abuse GPU, drain your bandwidth. Always nginx + basic auth or JWT.

## Option B — vLLM (production throughput) vLLM is built for serving many concurrent requests efficiently with continuous batching. Much higher throughput than Ollama at scale. Requires GPU ideally (vLLM has CPU support but rarely used in production). ```bash pip install vllm # Start server — OpenAI-compatible API vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000 --max-model-len 8192 --dtype auto ``` Then call with any OpenAI SDK: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") resp = client.chat.completions.create( model="meta-llama/Llama-3.2-3B-Instruct", messages=[{"role": "user", "content": "Hello"}], ) ``` Drop-in compatible — rest of your app doesn't know it's self-hosted. ## Option C — llama.cpp (bare metal, CPU-optimised) For squeezing every bit of CPU performance: ```bash git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp make -j $(nproc) LLAMA_AVX2=1 # Download a GGUF model wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf # Serve ./server -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080 ``` Lowest overhead, highest tokens/sec on CPU. Ollama actually wraps llama.cpp — use Ollama for convenience, bare llama.cpp when you need max performance. ## Picking a model (2026 options)
ModelSizeStrengthsWeaknesses
Llama 3.2 3B2 GB (4-bit)Fast, good EnglishWeaker at code, non-English
Mistral 7B4 GBBalanced, good reasoningMediocre Indian languages
Qwen 2.5 7B5 GBExcellent multilingual, codeLarger than Mistral
Llama 3 8B5 GBBest general-purposeSame tier as Mistral+Qwen
Phi-3.5 mini 3.8B2.5 GBTiny + punches above weightLimited factual knowledge
Nemotron 70B (2-bit)16 GBGPT-4-class qualitySlow, large
For Indian language support: **Qwen 2.5** or **Sarvam models** (IIT-made, Hindi/Tamil/Telugu native). ## Production patterns ### Pattern 1 — API key gateway Don't expose your LLM server directly. Put your own API in front that: - Validates API keys (not basic auth) - Rate limits per key - Logs usage for billing - Falls back to paid API if self-hosted is down ```python # FastAPI gateway example from fastapi import FastAPI, HTTPException, Depends, Header import httpx app = FastAPI() async def validate_key(x_api_key: str = Header()): if not await redis.sismember('valid_keys', x_api_key): raise HTTPException(401, 'Invalid key') return x_api_key @app.post("/v1/chat/completions") async def chat(req: dict, api_key: str = Depends(validate_key)): # Rate limit count = await redis.incr(f'ratelimit:{api_key}:{now_minute}') if count > 60: raise HTTPException(429, 'Rate limit') # Forward to local LLM async with httpx.AsyncClient() as client: resp = await client.post("http://localhost:11434/v1/chat/completions", json=req) return resp.json() ``` ### Pattern 2 — Model warmup Cold start on a model is 10–30 seconds. Keep it loaded: ```bash # Ping every 2 minutes to keep model in RAM */2 * * * * curl -s http://localhost:11434/api/generate -d '{"model":"mistral","prompt":"hi","stream":false}' > /dev/null ``` ### Pattern 3 — Fallback chain ```python try: return await ollama_chat(prompt) # self-hosted, free except (TimeoutError, HTTPError): return await openai_chat(prompt) # paid fallback ``` Free by default, pay only during peaks or outages. ## Common pitfalls ## FAQ
Q Do I need a GPU?

No — CPU works for small models (3B–7B). Speed is 3–12 tok/sec which is usable for most chatbots. GPU only if you need >20 tok/sec or big models (>13B).

Q How much does a self-hosted LLM cost vs OpenAI?

Break-even for Mistral 7B + typical usage is around 500K–1M requests/month. Below that, paid APIs win. Above, self-hosted wins — especially with Enterprise VPS (₹10,000/mo flat fee vs $500+ OpenAI spend).

Q Can I fine-tune on my data?

Yes — LoRA fine-tuning runs on CPU (slow, ~days) or GPU (hours). Start with RAG before fine-tuning — cheaper, often enough.

Q Llama 3 or Mistral or Qwen?

Llama 3 for general English. Mistral 7B for balanced. Qwen 2.5 for multilingual (including Hindi, Tamil). All open-weights.

Q Can Ollama serve multiple models?

Yes — pre-pull them, keep the one you use most warm. Ollama auto-unloads idle models after 5 min. Tune via OLLAMA_KEEP_ALIVE=30m.

Self-host your own LLM on a DomainIndia VPS — privacy + cost control. View VPS plans

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket