# Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval
TL;DR
Retrieval-Augmented Generation (RAG) combines an LLM with your own data — product docs, knowledge base, legal contracts — so the AI answers from your content instead of hallucinating. This guide walks through the full RAG stack: chunking, embeddings, vector databases (pgvector, Qdrant, Weaviate), and retrieval on a DomainIndia VPS.
## What RAG solves
A raw LLM only knows what it learned at training. It doesn't know:
- Your product manual written last month
- Your company's internal policies
- Your customer's support history
- Today's pricing
RAG fixes this in three steps:
1. **Index** — split your docs into chunks, compute embeddings, store in a vector DB
2. **Retrieve** — on user question, find the most relevant chunks by vector similarity
3. **Generate** — send question + retrieved chunks to the LLM, which answers grounded in your data
End result: AI that speaks in your voice, from your source of truth.
## The RAG stack on DomainIndia
| Component | Shared hosting | VPS | Recommendation |
| LLM API (OpenAI/Claude) | Yes | Yes | Any plan — API is remote |
| Embedding API (OpenAI/Voyage) | Yes | Yes | Any plan |
| Vector DB (pgvector) | No (needs PG extension) | Yes | VPS |
| Vector DB (Qdrant/Weaviate) | No | Yes | VPS |
| Chunking + orchestration (LangChain, LlamaIndex) | Limited | Full | VPS for production |
For anything beyond a prototype, use a **DomainIndia VPS** — you need persistent processes and a database with vector support.
## Step 1 — Choose a vector database
| Vector DB | Setup | Best for | RAM |
| pgvector | PostgreSQL extension | You already use Postgres | +50 MB over Postgres |
| Qdrant | Single binary | Medium-large projects | 500 MB+ |
| Weaviate | Docker container | Larger scale, hybrid search | 1 GB+ |
| Chroma | Python lib, embedded | Prototyping, <100K docs | 100 MB |
**For most DomainIndia customers, pgvector is the winner** — you already have PostgreSQL, one extension install, no new service.
## Step 2 — Install pgvector on DomainIndia VPS
2
Install PostgreSQL (skip if already installed):
3
Install pgvector from source:
5
Verify: SELECT '[1,2,3]'::vector; should return [1,2,3].
## Step 3 — Design your schema
```sql
CREATE TABLE documents (
id bigserial PRIMARY KEY,
source text NOT NULL, -- "docs/setup.md" or URL
chunk_index int NOT NULL, -- 0, 1, 2 ...
content text NOT NULL, -- the chunk itself
embedding vector(1536), -- OpenAI ada-002 dim; use 3072 for text-embedding-3-large
metadata jsonb, -- {category, author, date, ...}
created_at timestamptz DEFAULT now()
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents USING gin (metadata);
```
HNSW is pgvector's approximate nearest-neighbour index — fast at scale.
## Step 4 — Chunk your documents
Naive approach: split by character count. Better: split by headings/paragraphs with overlap.
Python example using LangChain:
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["
## ", "
### ", "
", "
", ". ", " "],
)
with open("docs/setup.md") as f:
chunks = splitter.split_text(f.read())
print(f"Got {len(chunks)} chunks")
```
Rule of thumb: **chunk size 500–1500 characters, overlap 10–20%**. Smaller = more precise retrieval but more chunks; bigger = more context per chunk but less focused.
## Step 5 — Compute embeddings
OpenAI has the most widely-used embedding API. Cheap: ~$0.02 per 1M tokens.
```python
from openai import OpenAI
import psycopg2
client = OpenAI()
def embed(text):
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
conn = psycopg2.connect("dbname=your_db")
cur = conn.cursor()
for i, chunk in enumerate(chunks):
vec = embed(chunk)
cur.execute(
"INSERT INTO documents (source, chunk_index, content, embedding, metadata) "
"VALUES (%s, %s, %s, %s, %s)",
("docs/setup.md", i, chunk, vec, '{"category": "setup"}')
)
conn.commit()
```
**Batch embeddings** — OpenAI accepts up to 2048 inputs per call. 10× cheaper than one-at-a-time:
```python
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks_batch # list of strings, max 2048
)
vecs = [item.embedding for item in response.data]
```
Warning
Index embeddings, not text. Re-embedding on every change wastes money. Store the embedding + source-hash; only re-embed when the source text changes.
## Step 6 — Retrieve
On user question, embed the question and find similar chunks:
```python
def retrieve(question, k=5):
q_vec = embed(question)
cur.execute(
"""
SELECT content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(q_vec, q_vec, k)
)
return cur.fetchall()
chunks = retrieve("How do I set up SSL on DirectAdmin?")
for chunk, meta, sim in chunks:
print(f"[{sim:.3f}] {chunk[:100]}...")
```
`<=>` is pgvector's cosine distance operator. Lower = more similar.
## Step 7 — Generate the answer
Send question + retrieved chunks to the LLM:
```python
context = "
---
".join(chunk for chunk, _, _ in chunks)
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"You are a helpful assistant answering questions from the provided context. "
"If the answer isn't in the context, say 'I don't know'. Don't make things up."
)},
{"role": "user", "content": f"Context:
{context}
Question: {question}"},
],
temperature=0.1,
)
print(completion.choices[0].message.content)
```
## Hybrid search — better than pure vector
Pure vector retrieval misses exact matches (product codes, SKUs). Combine with keyword search:
```sql
SELECT content, metadata
FROM documents
WHERE (
embedding <=> %s::vector < 0.3 -- semantic match
OR content @@ to_tsquery('english', %s) -- keyword match
)
ORDER BY (
0.7 * (1 - (embedding <=> %s::vector))
+ 0.3 * ts_rank(to_tsvector('english', content), to_tsquery('english', %s))
) DESC
LIMIT 5;
```
## Evaluating retrieval quality
Build a small eval set — 20–50 question-answer pairs from your actual docs. Measure:
- **Recall@5** — is the right chunk in the top 5? Target >80%.
- **Precision@5** — of the top 5, how many are relevant?
- **Answer accuracy** — does the LLM give correct answers?
Tools: [RAGAS](https://github.com/explodinggradients/ragas) (Python lib) measures all of these automatically.
## Common pitfalls
## FAQ
Q
pgvector or Qdrant?
pgvector if you already use Postgres and have <5M chunks. Qdrant if you need >50K queries per second, need multi-tenancy, or prefer a specialised DB. For most DomainIndia customers, pgvector wins.
Q
How much RAM for vector DB?
pgvector on a 4 GB VPS handles ~1M vectors comfortably. For 10M+, go to 8 GB or use Qdrant with disk-based storage.
Q
Can I use open-source embeddings instead of OpenAI?
Yes. sentence-transformers runs locally — all-MiniLM-L6-v2 is 80 MB, decent quality, zero API cost. Needs a VPS with ~2 GB RAM for inference.
Q
How do I stop the LLM from hallucinating even with RAG?
(a) Strict system prompt: "If not in context, say 'I don't know'." (b) Low temperature (0–0.2). (c) Return citations (chunk source) so users can verify.
Q
What about Claude instead of OpenAI?
Works identically. Use Voyage AI embeddings (Anthropic's partner) or OpenAI embeddings with Claude for generation — they interoperate.