Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval

Q: Can I use open-source embeddings instead of OpenAI?

Yes. `sentence-transformers` runs locally — `all-MiniLM-L6-v2` is 80 MB, decent quality, zero API cost. Needs a VPS with ~2 GB RAM for inference.

ByDomain India Team·DomainIndia Engineering

6 min readPublished 21 Apr 2026Updated 12 Jul 2026219 views

Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval

TL;DR

Retrieval-Augmented Generation (RAG) combines an LLM with your own data — product docs, knowledge base, legal contracts — so the AI answers from your content instead of hallucinating. This guide walks through the full RAG stack: chunking, embeddings, vector databases (pgvector, Qdrant, Weaviate), and retrieval on a DomainIndia VPS.

What RAG solves

A raw LLM only knows what it learned at training. It doesn't know:

Your product manual written last month
Your company's internal policies
Your customer's support history
Today's pricing

RAG fixes this in three steps:

Index — split your docs into chunks, compute embeddings, store in a vector DB
Retrieve — on user question, find the most relevant chunks by vector similarity
Generate — send question + retrieved chunks to the LLM, which answers grounded in your data

End result: AI that speaks in your voice, from your source of truth.

The RAG stack on DomainIndia

Component	Shared hosting	VPS	Recommendation
LLM API (OpenAI/Claude)	Yes	Yes	Any plan — API is remote
Embedding API (OpenAI/Voyage)	Yes	Yes	Any plan
Vector DB (pgvector)	No (needs PG extension)	Yes	VPS
Vector DB (Qdrant/Weaviate)	No	Yes	VPS
Chunking + orchestration (LangChain, LlamaIndex)	Limited	Full	VPS for production

For anything beyond a prototype, use a DomainIndia VPS — you need persistent processes and a database with vector support.

Step 1 — Choose a vector database

Vector DB	Setup	Best for	RAM
pgvector	PostgreSQL extension	You already use Postgres	+50 MB over Postgres
Qdrant	Single binary	Medium-large projects	500 MB+
Weaviate	Docker container	Larger scale, hybrid search	1 GB+
Chroma	Python lib, embedded	Prototyping, <100K docs	100 MB

For most DomainIndia customers, pgvector is the winner — you already have PostgreSQL, one extension install, no new service.

Step 2 — Install pgvector on DomainIndia VPS

SSH in as root
Install PostgreSQL (skip if already installed):

```bash

sudo dnf install -y postgresql-server postgresql-contrib postgresql-devel

sudo postgresql-setup --initdb

sudo systemctl enable --now postgresql

```

Install pgvector from source:

```bash

cd /tmp

git clone --branch v0.7.0 https://github.com/pgvector/pgvector.git

cd pgvector

sudo dnf install -y gcc

make

sudo make install

```

Enable in your database:

```sql

psql -U postgres -d your_db -c "CREATE EXTENSION vector;"

```

Verify: SELECT '[1,2,3]'::vector; should return [1,2,3].

Step 3 — Design your schema

sql

CREATE TABLE documents (
    id          bigserial PRIMARY KEY,
    source      text NOT NULL,         -- "docs/setup.md" or URL
    chunk_index int NOT NULL,          -- 0, 1, 2 ...
    content     text NOT NULL,         -- the chunk itself
    embedding   vector(1536),          -- OpenAI ada-002 dim; use 3072 for text-embedding-3-large
    metadata    jsonb,                 -- {category, author, date, ...}
    created_at  timestamptz DEFAULT now()
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents USING gin (metadata);

HNSW is pgvector's approximate nearest-neighbour index — fast at scale.

Step 4 — Chunk your documents

Naive approach: split by character count. Better: split by headings/paragraphs with overlap.

Python example using LangChain:

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["
## ", "
### ", "

", "
", ". ", " "],
)

with open("docs/setup.md") as f:
    chunks = splitter.split_text(f.read())

print(f"Got {len(chunks)} chunks")

Rule of thumb: chunk size 500–1500 characters, overlap 10–20%. Smaller = more precise retrieval but more chunks; bigger = more context per chunk but less focused.

Step 5 — Compute embeddings

OpenAI has the most widely-used embedding API. Cheap: ~$0.02 per 1M tokens.

python

from openai import OpenAI
import psycopg2

client = OpenAI()

def embed(text):
    return client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    ).data[0].embedding

conn = psycopg2.connect("dbname=your_db")
cur = conn.cursor()

for i, chunk in enumerate(chunks):
    vec = embed(chunk)
    cur.execute(
        "INSERT INTO documents (source, chunk_index, content, embedding, metadata) "
        "VALUES (%s, %s, %s, %s, %s)",
        ("docs/setup.md", i, chunk, vec, '{"category": "setup"}')
    )
conn.commit()

Batch embeddings — OpenAI accepts up to 2048 inputs per call. 10× cheaper than one-at-a-time:

python

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks_batch  # list of strings, max 2048
)
vecs = [item.embedding for item in response.data]

Warning

Index embeddings, not text. Re-embedding on every change wastes money. Store the embedding + source-hash; only re-embed when the source text changes.

Step 6 — Retrieve

On user question, embed the question and find similar chunks:

python

def retrieve(question, k=5):
    q_vec = embed(question)
    cur.execute(
        """
        SELECT content, metadata,
               1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (q_vec, q_vec, k)
    )
    return cur.fetchall()

chunks = retrieve("How do I set up SSL on DirectAdmin?")
for chunk, meta, sim in chunks:
    print(f"[{sim:.3f}] {chunk[:100]}...")

<=> is pgvector's cosine distance operator. Lower = more similar.

Step 7 — Generate the answer

Send question + retrieved chunks to the LLM:

python

context = "
---
".join(chunk for chunk, _, _ in chunks)

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": (
            "You are a helpful assistant answering questions from the provided context. "
            "If the answer isn't in the context, say 'I don't know'. Don't make things up."
        )},
        {"role": "user", "content": f"Context:
{context}

Question: {question}"},
    ],
    temperature=0.1,
)

print(completion.choices[0].message.content)

Hybrid search — better than pure vector

Pure vector retrieval misses exact matches (product codes, SKUs). Combine with keyword search:

sql

SELECT content, metadata
FROM documents
WHERE (
    embedding <=> %s::vector < 0.3       -- semantic match
    OR content @@ to_tsquery('english', %s)  -- keyword match
)
ORDER BY (
    0.7 * (1 - (embedding <=> %s::vector))
    + 0.3 * ts_rank(to_tsvector('english', content), to_tsquery('english', %s))
) DESC
LIMIT 5;

Evaluating retrieval quality

Build a small eval set — 20–50 question-answer pairs from your actual docs. Measure:

Recall@5 — is the right chunk in the top 5? Target >80%.
Precision@5 — of the top 5, how many are relevant?
Answer accuracy — does the LLM give correct answers?

Tools: RAGAS (Python lib) measures all of these automatically.

Common pitfalls

FAQ

Q pgvector or Qdrant?

pgvector if you already use Postgres and have <5M chunks. Qdrant if you need >50K queries per second, need multi-tenancy, or prefer a specialised DB. For most DomainIndia customers, pgvector wins.

Q How much RAM for vector DB?

pgvector on a 4 GB VPS handles ~1M vectors comfortably. For 10M+, go to 8 GB or use Qdrant with disk-based storage.

Q Can I use open-source embeddings instead of OpenAI?

Yes. sentence-transformers runs locally — all-MiniLM-L6-v2 is 80 MB, decent quality, zero API cost. Needs a VPS with ~2 GB RAM for inference.

Q How do I stop the LLM from hallucinating even with RAG?

(a) Strict system prompt: "If not in context, say 'I don't know'." (b) Low temperature (0–0.2). (c) Return citations (chunk source) so users can verify.

Q What about Claude instead of OpenAI?

Works identically. Use Voyage AI embeddings (Anthropic's partner) or OpenAI embeddings with Claude for generation — they interoperate.

RAG needs a VPS for the vector DB + embedding pipeline. Choose VPS plan

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket

Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval

In this article

Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval

What RAG solves

The RAG stack on DomainIndia

Step 1 — Choose a vector database

Step 2 — Install pgvector on DomainIndia VPS

Step 3 — Design your schema

Step 4 — Chunk your documents

Step 5 — Compute embeddings

Step 6 — Retrieve

Step 7 — Generate the answer

Hybrid search — better than pure vector

Evaluating retrieval quality

Common pitfalls

FAQ

Was this article helpful?

Related Articles

Still need help?