Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval
What RAG solves
A raw LLM only knows what it learned at training. It doesn't know:
- Your product manual written last month
- Your company's internal policies
- Your customer's support history
- Today's pricing
RAG fixes this in three steps:
- Index — split your docs into chunks, compute embeddings, store in a vector DB
- Retrieve — on user question, find the most relevant chunks by vector similarity
- Generate — send question + retrieved chunks to the LLM, which answers grounded in your data
End result: AI that speaks in your voice, from your source of truth.
The RAG stack on DomainIndia
| Component | Shared hosting | VPS | Recommendation |
|---|---|---|---|
| LLM API (OpenAI/Claude) | Yes | Yes | Any plan — API is remote |
| Embedding API (OpenAI/Voyage) | Yes | Yes | Any plan |
| Vector DB (pgvector) | No (needs PG extension) | Yes | VPS |
| Vector DB (Qdrant/Weaviate) | No | Yes | VPS |
| Chunking + orchestration (LangChain, LlamaIndex) | Limited | Full | VPS for production |
For anything beyond a prototype, use a DomainIndia VPS — you need persistent processes and a database with vector support.
Step 1 — Choose a vector database
| Vector DB | Setup | Best for | RAM |
|---|---|---|---|
| pgvector | PostgreSQL extension | You already use Postgres | +50 MB over Postgres |
| Qdrant | Single binary | Medium-large projects | 500 MB+ |
| Weaviate | Docker container | Larger scale, hybrid search | 1 GB+ |
| Chroma | Python lib, embedded | Prototyping, <100K docs | 100 MB |
For most DomainIndia customers, pgvector is the winner — you already have PostgreSQL, one extension install, no new service.
Step 2 — Install pgvector on DomainIndia VPS
- SSH in as root
- Install PostgreSQL (skip if already installed):
```bash
sudo dnf install -y postgresql-server postgresql-contrib postgresql-devel
sudo postgresql-setup --initdb
sudo systemctl enable --now postgresql
```
- Install pgvector from source:
```bash
cd /tmp
git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git
cd pgvector
sudo dnf install -y gcc
make
sudo make install
```
- Enable in your database:
```sql
psql -U postgres -d your_db -c "CREATE EXTENSION vector;"
```
- Verify:
SELECT '[1,2,3]'::vector;should return[1,2,3].
Step 3 — Design your schema
CREATE TABLE documents (
id bigserial PRIMARY KEY,
source text NOT NULL, -- "docs/setup.md" or URL
chunk_index int NOT NULL, -- 0, 1, 2 ...
content text NOT NULL, -- the chunk itself
embedding vector(1536), -- OpenAI ada-002 dim; use 3072 for text-embedding-3-large
metadata jsonb, -- {category, author, date, ...}
created_at timestamptz DEFAULT now()
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents USING gin (metadata);HNSW is pgvector's approximate nearest-neighbour index — fast at scale.
Step 4 — Chunk your documents
Naive approach: split by character count. Better: split by headings/paragraphs with overlap.
Python example using LangChain:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["
## ", "
### ", "
", "
", ". ", " "],
)
with open("docs/setup.md") as f:
chunks = splitter.split_text(f.read())
print(f"Got {len(chunks)} chunks")Rule of thumb: chunk size 500–1500 characters, overlap 10–20%. Smaller = more precise retrieval but more chunks; bigger = more context per chunk but less focused.
Step 5 — Compute embeddings
OpenAI has the most widely-used embedding API. Cheap: ~$0.02 per 1M tokens.
from openai import OpenAI
import psycopg2
client = OpenAI()
def embed(text):
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
conn = psycopg2.connect("dbname=your_db")
cur = conn.cursor()
for i, chunk in enumerate(chunks):
vec = embed(chunk)
cur.execute(
"INSERT INTO documents (source, chunk_index, content, embedding, metadata) "
"VALUES (%s, %s, %s, %s, %s)",
("docs/setup.md", i, chunk, vec, '{"category": "setup"}')
)
conn.commit()Batch embeddings — OpenAI accepts up to 2048 inputs per call. 10× cheaper than one-at-a-time:
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks_batch # list of strings, max 2048
)
vecs = [item.embedding for item in response.data]Index embeddings, not text. Re-embedding on every change wastes money. Store the embedding + source-hash; only re-embed when the source text changes.
Step 6 — Retrieve
On user question, embed the question and find similar chunks:
def retrieve(question, k=5):
q_vec = embed(question)
cur.execute(
"""
SELECT content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(q_vec, q_vec, k)
)
return cur.fetchall()
chunks = retrieve("How do I set up SSL on DirectAdmin?")
for chunk, meta, sim in chunks:
print(f"[{sim:.3f}] {chunk[:100]}...")<=> is pgvector's cosine distance operator. Lower = more similar.
Step 7 — Generate the answer
Send question + retrieved chunks to the LLM:
context = "
---
".join(chunk for chunk, _, _ in chunks)
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"You are a helpful assistant answering questions from the provided context. "
"If the answer isn't in the context, say 'I don't know'. Don't make things up."
)},
{"role": "user", "content": f"Context:
{context}
Question: {question}"},
],
temperature=0.1,
)
print(completion.choices[0].message.content)Hybrid search — better than pure vector
Pure vector retrieval misses exact matches (product codes, SKUs). Combine with keyword search:
SELECT content, metadata
FROM documents
WHERE (
embedding <=> %s::vector < 0.3 -- semantic match
OR content @@ to_tsquery('english', %s) -- keyword match
)
ORDER BY (
0.7 * (1 - (embedding <=> %s::vector))
+ 0.3 * ts_rank(to_tsvector('english', content), to_tsquery('english', %s))
) DESC
LIMIT 5;Evaluating retrieval quality
Build a small eval set — 20–50 question-answer pairs from your actual docs. Measure:
- Recall@5 — is the right chunk in the top 5? Target >80%.
- Precision@5 — of the top 5, how many are relevant?
- Answer accuracy — does the LLM give correct answers?
Tools: RAGAS (Python lib) measures all of these automatically.
Common pitfalls
FAQ
pgvector if you already use Postgres and have <5M chunks. Qdrant if you need >50K queries per second, need multi-tenancy, or prefer a specialised DB. For most DomainIndia customers, pgvector wins.
pgvector on a 4 GB VPS handles ~1M vectors comfortably. For 10M+, go to 8 GB or use Qdrant with disk-based storage.
Yes. sentence-transformers runs locally — all-MiniLM-L6-v2 is 80 MB, decent quality, zero API cost. Needs a VPS with ~2 GB RAM for inference.
(a) Strict system prompt: "If not in context, say 'I don't know'." (b) Low temperature (0–0.2). (c) Return citations (chunk source) so users can verify.
Works identically. Use Voyage AI embeddings (Anthropic's partner) or OpenAI embeddings with Claude for generation — they interoperate.
RAG needs a VPS for the vector DB + embedding pipeline. Choose VPS plan