Client Area

Fine-Tuning LLMs with LoRA on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering
5 min readPublished 24 Apr 2026Updated 23 Jun 2026229 views

In this article

  • 1Why fine-tune instead of RAG
  • 2LoRA in one paragraph
  • 3Hardware requirements
  • 4Step 1 — Dataset preparation
  • 5Step 2 — Setup environment

Fine-Tuning LLMs with LoRA on DomainIndia VPS

TL;DR
LoRA (Low-Rank Adaptation) makes it practical to fine-tune a 7B-13B model on consumer hardware or a beefy DomainIndia VPS. Instead of retraining billions of parameters, you train a tiny adapter — minutes to hours instead of days, 100 MB output instead of 14 GB. This guide walks through dataset prep, training, serving, and deployment.

Why fine-tune instead of RAG

Before fine-tuning, ask: does RAG solve the problem? (See our RAG guide.)

NeedRAGFine-tuning
Inject recent factsBest fitWrong tool
Teach specific writing stylePossibleBest fit
Handle domain jargonOKBest fit
Follow strict output formatHit or missBest fit
Reduce hallucinationGoodMarginal
Lower per-request costYes (smaller model works)

Fine-tune when you need the model to consistently behave a certain way (tone, format, refusals). RAG when you need it to know new things.

LoRA in one paragraph

LoRA freezes the full model and trains two small matrices (rank 8-64) that "patch" each attention layer. Output: a ~50-200 MB adapter file that applies over the base model. You can keep 100 adapters on disk and swap them per-user or per-task without reloading the base.

Hardware requirements

ModelFull fine-tuneLoRA (4-bit QLoRA)VPS needed
Llama 3.2 3B24 GB VRAM6 GB VRAMGPU VPS (rented GPU hours)
Llama 3 8B60 GB VRAM10 GB VRAMGPU VPS
Mistral 7B50 GB VRAM8 GB VRAMGPU VPS

Pure CPU fine-tuning is possible but slow (days for 7B). For learning, rent a GPU hour from Vast.ai or RunPod (~$0.30/hr for RTX 3090). For ongoing work, a VPS with dedicated GPU.

Step 1 — Dataset preparation

Fine-tuning needs 100-10,000 high-quality (input, output) examples. Format as JSONL:

training.jsonl:

jsonl
{"instruction":"Write a support reply about domain transfer","input":"My .in domain is locked","output":"Hi, .in domains require you to first unlock at your current registrar and share the auth code with us. To unlock..."}
{"instruction":"Write a support reply about SSL","input":"SSL not working on my site","output":"Hi, Please verify the SSL was installed for the correct domain..."}

Quality > quantity. 500 excellent examples beat 10,000 mediocre ones.

Good sources for data:

  • Past support tickets (anonymised)
  • KB articles (convert Q-style FAQ pairs)
  • Your brand voice guide examples
  • Customer-approved responses

Step 2 — Setup environment

On a GPU VPS (AlmaLinux 9 or Ubuntu 22.04 with NVIDIA driver + CUDA 12):

bash
# Python + deps
python3.12 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets peft accelerate bitsandbytes trl

# Hugging Face login (for gated models like Llama)
huggingface-cli login

Step 3 — Train (QLoRA)

train.py:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

# 4-bit quantization config (fits in 6 GB VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 12M || all params: 3B || trainable%: 0.4%

# Load dataset
def format_prompt(example):
    return f"""### Instruction: {example['instruction']}
### Input: {example['input']}
### Response: {example['output']}"""

dataset = load_dataset("json", data_files="training.jsonl", split="train")
dataset = dataset.map(lambda e: {"text": format_prompt(e)})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./adapter",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=False,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./adapter-final")

Run:

bash
python train.py
# Watch loss decrease over epochs
# Output: ./adapter-final/adapter_model.bin (~100 MB)

Step 4 — Inference with the adapter

python
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "./adapter-final")

def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=300)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate("### Instruction: Reply about VPS
### Input: How to install Docker?
### Response:"))

Step 5 — Merge adapter for production

To simplify serving, merge the LoRA adapter into the base weights:

python
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("./adapter-final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Now ./merged-model is a standalone Llama model tuned to your domain. Upload to Hugging Face or serve via Ollama / vLLM (see Self-hosting LLMs guide).

Step 6 — Evaluate

Hold out 10% of your dataset as eval set. Measure:

  • Exact match — output exactly matches expected
  • BLEU / ROUGE — similarity scores
  • Human eval — does a reviewer prefer fine-tuned output?

Simple script:

python
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = []
for row in eval_set:
    predicted = generate(row['prompt'])
    s = scorer.score(row['expected'], predicted)
    scores.append(s['rougeL'].fmeasure)
print(f"Average ROUGE-L: {sum(scores)/len(scores):.3f}")

Common pitfalls

FAQ

Q Fine-tuning cost?

For a 7B model with 1000 examples: ~$2-5 on rented GPU. Scales linearly. Open-source models have no per-token licensing.

Q Can I fine-tune GPT-4 / Claude?

OpenAI fine-tuning is available ($$, per-token during training). Claude not publicly offered yet. For customisation at cost, open-source LLMs + LoRA win.

Q Should I fine-tune for RAG?

Generally no. RAG + a good general model outperforms fine-tuned-without-RAG for factual queries. Fine-tune for style/format; use RAG for knowledge.

Q How often to re-train?

When base model updates (new Llama release) or data drifts. Every 3-6 months is typical.

Q What about DPO / RLHF?

Advanced techniques for alignment. Use LoRA first to see if supervised fine-tuning meets your need; DPO if you need preference-based tuning.

Production LLM fine-tuning needs GPU — rent by the hour or order a GPU VPS. Explore VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket