# Fine-Tuning LLMs with LoRA on DomainIndia VPS
TL;DR
LoRA (Low-Rank Adaptation) makes it practical to fine-tune a 7B-13B model on consumer hardware or a beefy DomainIndia VPS. Instead of retraining billions of parameters, you train a tiny adapter — minutes to hours instead of days, 100 MB output instead of 14 GB. This guide walks through dataset prep, training, serving, and deployment.
## Why fine-tune instead of RAG
Before fine-tuning, ask: **does RAG solve the problem?** (See our [RAG guide](https://domainindia.com/support/kb/building-rag-system-vector-db-embeddings).)
| Need | RAG | Fine-tuning |
| Inject recent facts | Best fit | Wrong tool |
| Teach specific writing style | Possible | Best fit |
| Handle domain jargon | OK | Best fit |
| Follow strict output format | Hit or miss | Best fit |
| Reduce hallucination | Good | Marginal |
| Lower per-request cost | — | Yes (smaller model works) |
Fine-tune when you need the model to consistently behave a certain way (tone, format, refusals). RAG when you need it to know new things.
## LoRA in one paragraph
LoRA freezes the full model and trains two small matrices (rank 8-64) that "patch" each attention layer. Output: a ~50-200 MB adapter file that applies over the base model. You can keep 100 adapters on disk and swap them per-user or per-task without reloading the base.
## Hardware requirements
| Model | Full fine-tune | LoRA (4-bit QLoRA) | VPS needed |
| Llama 3.2 3B | 24 GB VRAM | 6 GB VRAM | GPU VPS (rented GPU hours) |
| Llama 3 8B | 60 GB VRAM | 10 GB VRAM | GPU VPS |
| Mistral 7B | 50 GB VRAM | 8 GB VRAM | GPU VPS |
Pure CPU fine-tuning is possible but slow (days for 7B). For learning, rent a GPU hour from [Vast.ai](https://vast.ai) or [RunPod](https://runpod.io) (~$0.30/hr for RTX 3090). For ongoing work, a VPS with dedicated GPU.
## Step 1 — Dataset preparation
Fine-tuning needs 100-10,000 high-quality (input, output) examples. Format as JSONL:
`training.jsonl`:
```jsonl
{"instruction":"Write a support reply about domain transfer","input":"My .in domain is locked","output":"Hi, .in domains require you to first unlock at your current registrar and share the auth code with us. To unlock..."}
{"instruction":"Write a support reply about SSL","input":"SSL not working on my site","output":"Hi, Please verify the SSL was installed for the correct domain..."}
```
Quality > quantity. 500 excellent examples beat 10,000 mediocre ones.
Good sources for data:
- Past support tickets (anonymised)
- KB articles (convert Q-style FAQ pairs)
- Your brand voice guide examples
- Customer-approved responses
## Step 2 — Setup environment
On a GPU VPS (AlmaLinux 9 or Ubuntu 22.04 with NVIDIA driver + CUDA 12):
```bash
# Python + deps
python3.12 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets peft accelerate bitsandbytes trl
# Hugging Face login (for gated models like Llama)
huggingface-cli login
```
## Step 3 — Train (QLoRA)
`train.py`:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"
# 4-bit quantization config (fits in 6 GB VRAM)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 12M || all params: 3B || trainable%: 0.4%
# Load dataset
def format_prompt(example):
return f"""### Instruction: {example['instruction']}
### Input: {example['input']}
### Response: {example['output']}"""
dataset = load_dataset("json", data_files="training.jsonl", split="train")
dataset = dataset.map(lambda e: {"text": format_prompt(e)})
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=SFTConfig(
output_dir="./adapter",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=False,
bf16=True,
logging_steps=10,
save_strategy="epoch",
),
peft_config=lora_config,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
trainer.save_model("./adapter-final")
```
Run:
```bash
python train.py
# Watch loss decrease over epochs
# Output: ./adapter-final/adapter_model.bin (~100 MB)
```
## Step 4 — Inference with the adapter
```python
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(model, "./adapter-final")
def generate(prompt):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=300)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate("### Instruction: Reply about VPS
### Input: How to install Docker?
### Response:"))
```
## Step 5 — Merge adapter for production
To simplify serving, merge the LoRA adapter into the base weights:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
```
Now `./merged-model` is a standalone Llama model tuned to your domain. Upload to Hugging Face or serve via Ollama / vLLM (see [Self-hosting LLMs guide](https://domainindia.com/support/kb/self-hosting-llama-mistral-ollama-vllm-vps)).
## Step 6 — Evaluate
Hold out 10% of your dataset as eval set. Measure:
- **Exact match** — output exactly matches expected
- **BLEU / ROUGE** — similarity scores
- **Human eval** — does a reviewer prefer fine-tuned output?
Simple script:
```python
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = []
for row in eval_set:
predicted = generate(row['prompt'])
s = scorer.score(row['expected'], predicted)
scores.append(s['rougeL'].fmeasure)
print(f"Average ROUGE-L: {sum(scores)/len(scores):.3f}")
```
## Common pitfalls
## FAQ
Q
Fine-tuning cost?
For a 7B model with 1000 examples: ~$2-5 on rented GPU. Scales linearly. Open-source models have no per-token licensing.
Q
Can I fine-tune GPT-4 / Claude?
OpenAI fine-tuning is available ($$, per-token during training). Claude not publicly offered yet. For customisation at cost, open-source LLMs + LoRA win.
Q
Should I fine-tune for RAG?
Generally no. RAG + a good general model outperforms fine-tuned-without-RAG for factual queries. Fine-tune for style/format; use RAG for knowledge.
Q
How often to re-train?
When base model updates (new Llama release) or data drifts. Every 3-6 months is typical.
Q
What about DPO / RLHF?
Advanced techniques for alignment. Use LoRA first to see if supervised fine-tuning meets your need; DPO if you need preference-based tuning.
Production LLM fine-tuning needs GPU — rent by the hour or order a GPU VPS.
Explore VPS