Client Area

Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval

ByDomain India Team·DomainIndia Engineering
6 min read24 Apr 20263 views
# Building a RAG System on DomainIndia VPS: Vector Databases, Embeddings, and Retrieval
TL;DR
Retrieval-Augmented Generation (RAG) combines an LLM with your own data — product docs, knowledge base, legal contracts — so the AI answers from your content instead of hallucinating. This guide walks through the full RAG stack: chunking, embeddings, vector databases (pgvector, Qdrant, Weaviate), and retrieval on a DomainIndia VPS.
## What RAG solves A raw LLM only knows what it learned at training. It doesn't know: - Your product manual written last month - Your company's internal policies - Your customer's support history - Today's pricing RAG fixes this in three steps: 1. **Index** — split your docs into chunks, compute embeddings, store in a vector DB 2. **Retrieve** — on user question, find the most relevant chunks by vector similarity 3. **Generate** — send question + retrieved chunks to the LLM, which answers grounded in your data End result: AI that speaks in your voice, from your source of truth. ## The RAG stack on DomainIndia
ComponentShared hostingVPSRecommendation
LLM API (OpenAI/Claude)YesYesAny plan — API is remote
Embedding API (OpenAI/Voyage)YesYesAny plan
Vector DB (pgvector)No (needs PG extension)YesVPS
Vector DB (Qdrant/Weaviate)NoYesVPS
Chunking + orchestration (LangChain, LlamaIndex)LimitedFullVPS for production
For anything beyond a prototype, use a **DomainIndia VPS** — you need persistent processes and a database with vector support. ## Step 1 — Choose a vector database
Vector DBSetupBest forRAM
pgvectorPostgreSQL extensionYou already use Postgres+50 MB over Postgres
QdrantSingle binaryMedium-large projects500 MB+
WeaviateDocker containerLarger scale, hybrid search1 GB+
ChromaPython lib, embeddedPrototyping, <100K docs100 MB
**For most DomainIndia customers, pgvector is the winner** — you already have PostgreSQL, one extension install, no new service. ## Step 2 — Install pgvector on DomainIndia VPS
1
SSH in as root
2
Install PostgreSQL (skip if already installed):
3
Install pgvector from source:
4
Enable in your database:
5
Verify: SELECT '[1,2,3]'::vector; should return [1,2,3].
## Step 3 — Design your schema ```sql CREATE TABLE documents ( id bigserial PRIMARY KEY, source text NOT NULL, -- "docs/setup.md" or URL chunk_index int NOT NULL, -- 0, 1, 2 ... content text NOT NULL, -- the chunk itself embedding vector(1536), -- OpenAI ada-002 dim; use 3072 for text-embedding-3-large metadata jsonb, -- {category, author, date, ...} created_at timestamptz DEFAULT now() ); CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); CREATE INDEX ON documents USING gin (metadata); ``` HNSW is pgvector's approximate nearest-neighbour index — fast at scale. ## Step 4 — Chunk your documents Naive approach: split by character count. Better: split by headings/paragraphs with overlap. Python example using LangChain: ```python from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=[" ## ", " ### ", " ", " ", ". ", " "], ) with open("docs/setup.md") as f: chunks = splitter.split_text(f.read()) print(f"Got {len(chunks)} chunks") ``` Rule of thumb: **chunk size 500–1500 characters, overlap 10–20%**. Smaller = more precise retrieval but more chunks; bigger = more context per chunk but less focused. ## Step 5 — Compute embeddings OpenAI has the most widely-used embedding API. Cheap: ~$0.02 per 1M tokens. ```python from openai import OpenAI import psycopg2 client = OpenAI() def embed(text): return client.embeddings.create( model="text-embedding-3-small", input=text, ).data[0].embedding conn = psycopg2.connect("dbname=your_db") cur = conn.cursor() for i, chunk in enumerate(chunks): vec = embed(chunk) cur.execute( "INSERT INTO documents (source, chunk_index, content, embedding, metadata) " "VALUES (%s, %s, %s, %s, %s)", ("docs/setup.md", i, chunk, vec, '{"category": "setup"}') ) conn.commit() ``` **Batch embeddings** — OpenAI accepts up to 2048 inputs per call. 10× cheaper than one-at-a-time: ```python response = client.embeddings.create( model="text-embedding-3-small", input=chunks_batch # list of strings, max 2048 ) vecs = [item.embedding for item in response.data] ```
Warning

Index embeddings, not text. Re-embedding on every change wastes money. Store the embedding + source-hash; only re-embed when the source text changes.

## Step 6 — Retrieve On user question, embed the question and find similar chunks: ```python def retrieve(question, k=5): q_vec = embed(question) cur.execute( """ SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT %s """, (q_vec, q_vec, k) ) return cur.fetchall() chunks = retrieve("How do I set up SSL on DirectAdmin?") for chunk, meta, sim in chunks: print(f"[{sim:.3f}] {chunk[:100]}...") ``` `<=>` is pgvector's cosine distance operator. Lower = more similar. ## Step 7 — Generate the answer Send question + retrieved chunks to the LLM: ```python context = " --- ".join(chunk for chunk, _, _ in chunks) completion = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": ( "You are a helpful assistant answering questions from the provided context. " "If the answer isn't in the context, say 'I don't know'. Don't make things up." )}, {"role": "user", "content": f"Context: {context} Question: {question}"}, ], temperature=0.1, ) print(completion.choices[0].message.content) ``` ## Hybrid search — better than pure vector Pure vector retrieval misses exact matches (product codes, SKUs). Combine with keyword search: ```sql SELECT content, metadata FROM documents WHERE ( embedding <=> %s::vector < 0.3 -- semantic match OR content @@ to_tsquery('english', %s) -- keyword match ) ORDER BY ( 0.7 * (1 - (embedding <=> %s::vector)) + 0.3 * ts_rank(to_tsvector('english', content), to_tsquery('english', %s)) ) DESC LIMIT 5; ``` ## Evaluating retrieval quality Build a small eval set — 20–50 question-answer pairs from your actual docs. Measure: - **Recall@5** — is the right chunk in the top 5? Target >80%. - **Precision@5** — of the top 5, how many are relevant? - **Answer accuracy** — does the LLM give correct answers? Tools: [RAGAS](https://github.com/explodinggradients/ragas) (Python lib) measures all of these automatically. ## Common pitfalls ## FAQ
Q pgvector or Qdrant?

pgvector if you already use Postgres and have <5M chunks. Qdrant if you need >50K queries per second, need multi-tenancy, or prefer a specialised DB. For most DomainIndia customers, pgvector wins.

Q How much RAM for vector DB?

pgvector on a 4 GB VPS handles ~1M vectors comfortably. For 10M+, go to 8 GB or use Qdrant with disk-based storage.

Q Can I use open-source embeddings instead of OpenAI?

Yes. sentence-transformers runs locally — all-MiniLM-L6-v2 is 80 MB, decent quality, zero API cost. Needs a VPS with ~2 GB RAM for inference.

Q How do I stop the LLM from hallucinating even with RAG?

(a) Strict system prompt: "If not in context, say 'I don't know'." (b) Low temperature (0–0.2). (c) Return citations (chunk source) so users can verify.

Q What about Claude instead of OpenAI?

Works identically. Use Voyage AI embeddings (Anthropic's partner) or OpenAI embeddings with Claude for generation — they interoperate.

RAG needs a VPS for the vector DB + embedding pipeline. Choose VPS plan

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket