Fine-Tuning LLMs with LoRA on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering

5 min readPublished 24 Apr 2026Updated 13 Jul 2026353 views

Fine-Tuning LLMs with LoRA on DomainIndia VPS

TL;DR

LoRA (Low-Rank Adaptation) makes it practical to fine-tune a 7B-13B model on consumer hardware or a beefy DomainIndia VPS. Instead of retraining billions of parameters, you train a tiny adapter — minutes to hours instead of days, 100 MB output instead of 14 GB. This guide walks through dataset prep, training, serving, and deployment.

Why fine-tune instead of RAG

Before fine-tuning, ask: does RAG solve the problem? (See our RAG guide.)

Need	RAG	Fine-tuning
Inject recent facts	Best fit	Wrong tool
Teach specific writing style	Possible	Best fit
Handle domain jargon	OK	Best fit
Follow strict output format	Hit or miss	Best fit
Reduce hallucination	Good	Marginal
Lower per-request cost	—	Yes (smaller model works)

Fine-tune when you need the model to consistently behave a certain way (tone, format, refusals). RAG when you need it to know new things.

LoRA in one paragraph

LoRA freezes the full model and trains two small matrices (rank 8-64) that "patch" each attention layer. Output: a ~50-200 MB adapter file that applies over the base model. You can keep 100 adapters on disk and swap them per-user or per-task without reloading the base.

Hardware requirements

Model	Full fine-tune	LoRA (4-bit QLoRA)	VPS needed
Llama 3.2 3B	24 GB VRAM	6 GB VRAM	GPU VPS (rented GPU hours)
Llama 3 8B	60 GB VRAM	10 GB VRAM	GPU VPS
Mistral 7B	50 GB VRAM	8 GB VRAM	GPU VPS

Pure CPU fine-tuning is possible but slow (days for 7B). For learning, rent a GPU hour from Vast.ai or RunPod (~$0.30/hr for RTX 3090). For ongoing work, a VPS with dedicated GPU.

Step 1 — Dataset preparation

Fine-tuning needs 100-10,000 high-quality (input, output) examples. Format as JSONL:

training.jsonl:

jsonl

{"instruction":"Write a support reply about domain transfer","input":"My .in domain is locked","output":"Hi, .in domains require you to first unlock at your current registrar and share the auth code with us. To unlock..."}
{"instruction":"Write a support reply about SSL","input":"SSL not working on my site","output":"Hi, Please verify the SSL was installed for the correct domain..."}

Quality > quantity. 500 excellent examples beat 10,000 mediocre ones.

Good sources for data:

Past support tickets (anonymised)
KB articles (convert Q-style FAQ pairs)
Your brand voice guide examples
Customer-approved responses

Step 2 — Setup environment

On a GPU VPS (AlmaLinux 9 or Ubuntu 22.04 with NVIDIA driver + CUDA 12):

bash

# Python + deps
python3.12 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets peft accelerate bitsandbytes trl

# Hugging Face login (for gated models like Llama)
huggingface-cli login

Step 3 — Train (QLoRA)

train.py:

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

# 4-bit quantization config (fits in 6 GB VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 12M || all params: 3B || trainable%: 0.4%

# Load dataset
def format_prompt(example):
    return f"""### Instruction: {example['instruction']}
### Input: {example['input']}
### Response: {example['output']}"""

dataset = load_dataset("json", data_files="training.jsonl", split="train")
dataset = dataset.map(lambda e: {"text": format_prompt(e)})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./adapter",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=False,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./adapter-final")

Run:

bash

python train.py
# Watch loss decrease over epochs
# Output: ./adapter-final/adapter_model.bin (~100 MB)

Step 4 — Inference with the adapter

python

from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "./adapter-final")

def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=300)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate("### Instruction: Reply about VPS
### Input: How to install Docker?
### Response:"))

Step 5 — Merge adapter for production

To simplify serving, merge the LoRA adapter into the base weights:

python

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("./adapter-final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Now ./merged-model is a standalone Llama model tuned to your domain. Upload to Hugging Face or serve via Ollama / vLLM (see Self-hosting LLMs guide).

Step 6 — Evaluate

Hold out 10% of your dataset as eval set. Measure:

Exact match — output exactly matches expected
BLEU / ROUGE — similarity scores
Human eval — does a reviewer prefer fine-tuned output?

Simple script:

python

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = []
for row in eval_set:
    predicted = generate(row['prompt'])
    s = scorer.score(row['expected'], predicted)
    scores.append(s['rougeL'].fmeasure)
print(f"Average ROUGE-L: {sum(scores)/len(scores):.3f}")

Common pitfalls

FAQ

Q Fine-tuning cost?

For a 7B model with 1000 examples: ~$2-5 on rented GPU. Scales linearly. Open-source models have no per-token licensing.

Q Can I fine-tune GPT-4 / Claude?

OpenAI fine-tuning is available ($$, per-token during training). Claude not publicly offered yet. For customisation at cost, open-source LLMs + LoRA win.

Q Should I fine-tune for RAG?

Generally no. RAG + a good general model outperforms fine-tuned-without-RAG for factual queries. Fine-tune for style/format; use RAG for knowledge.

Q How often to re-train?

When base model updates (new Llama release) or data drifts. Every 3-6 months is typical.

Q What about DPO / RLHF?

Advanced techniques for alignment. Use LoRA first to see if supervised fine-tuning meets your need; DPO if you need preference-based tuning.

Production LLM fine-tuning needs GPU — rent by the hour or order a GPU VPS. Explore VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket

Fine-Tuning LLMs with LoRA on DomainIndia VPS

In this article

Fine-Tuning LLMs with LoRA on DomainIndia VPS

Why fine-tune instead of RAG

LoRA in one paragraph

Hardware requirements

Step 1 — Dataset preparation

Step 2 — Setup environment

Step 3 — Train (QLoRA)

Step 4 — Inference with the adapter

Step 5 — Merge adapter for production

Step 6 — Evaluate

Common pitfalls

FAQ

Was this article helpful?

Related Articles

Still need help?