Client Area

Fine-Tuning LLMs with LoRA on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering
5 min read24 Apr 20264 views
# Fine-Tuning LLMs with LoRA on DomainIndia VPS
TL;DR
LoRA (Low-Rank Adaptation) makes it practical to fine-tune a 7B-13B model on consumer hardware or a beefy DomainIndia VPS. Instead of retraining billions of parameters, you train a tiny adapter — minutes to hours instead of days, 100 MB output instead of 14 GB. This guide walks through dataset prep, training, serving, and deployment.
## Why fine-tune instead of RAG Before fine-tuning, ask: **does RAG solve the problem?** (See our [RAG guide](https://domainindia.com/support/kb/building-rag-system-vector-db-embeddings).)
NeedRAGFine-tuning
Inject recent factsBest fitWrong tool
Teach specific writing stylePossibleBest fit
Handle domain jargonOKBest fit
Follow strict output formatHit or missBest fit
Reduce hallucinationGoodMarginal
Lower per-request costYes (smaller model works)
Fine-tune when you need the model to consistently behave a certain way (tone, format, refusals). RAG when you need it to know new things. ## LoRA in one paragraph LoRA freezes the full model and trains two small matrices (rank 8-64) that "patch" each attention layer. Output: a ~50-200 MB adapter file that applies over the base model. You can keep 100 adapters on disk and swap them per-user or per-task without reloading the base. ## Hardware requirements
ModelFull fine-tuneLoRA (4-bit QLoRA)VPS needed
Llama 3.2 3B24 GB VRAM6 GB VRAMGPU VPS (rented GPU hours)
Llama 3 8B60 GB VRAM10 GB VRAMGPU VPS
Mistral 7B50 GB VRAM8 GB VRAMGPU VPS
Pure CPU fine-tuning is possible but slow (days for 7B). For learning, rent a GPU hour from [Vast.ai](https://vast.ai) or [RunPod](https://runpod.io) (~$0.30/hr for RTX 3090). For ongoing work, a VPS with dedicated GPU. ## Step 1 — Dataset preparation Fine-tuning needs 100-10,000 high-quality (input, output) examples. Format as JSONL: `training.jsonl`: ```jsonl {"instruction":"Write a support reply about domain transfer","input":"My .in domain is locked","output":"Hi, .in domains require you to first unlock at your current registrar and share the auth code with us. To unlock..."} {"instruction":"Write a support reply about SSL","input":"SSL not working on my site","output":"Hi, Please verify the SSL was installed for the correct domain..."} ``` Quality > quantity. 500 excellent examples beat 10,000 mediocre ones. Good sources for data: - Past support tickets (anonymised) - KB articles (convert Q-style FAQ pairs) - Your brand voice guide examples - Customer-approved responses ## Step 2 — Setup environment On a GPU VPS (AlmaLinux 9 or Ubuntu 22.04 with NVIDIA driver + CUDA 12): ```bash # Python + deps python3.12 -m venv venv source venv/bin/activate pip install torch --index-url https://download.pytorch.org/whl/cu121 pip install transformers datasets peft accelerate bitsandbytes trl # Hugging Face login (for gated models like Llama) huggingface-cli login ``` ## Step 3 — Train (QLoRA) `train.py`: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from datasets import load_dataset from trl import SFTTrainer, SFTConfig BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct" # 4-bit quantization config (fits in 6 GB VRAM) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) tokenizer.pad_token = tokenizer.eos_token # LoRA config lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = prepare_model_for_kbit_training(model) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 12M || all params: 3B || trainable%: 0.4% # Load dataset def format_prompt(example): return f"""### Instruction: {example['instruction']} ### Input: {example['input']} ### Response: {example['output']}""" dataset = load_dataset("json", data_files="training.jsonl", split="train") dataset = dataset.map(lambda e: {"text": format_prompt(e)}) # Train trainer = SFTTrainer( model=model, train_dataset=dataset, args=SFTConfig( output_dir="./adapter", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=False, bf16=True, logging_steps=10, save_strategy="epoch", ), peft_config=lora_config, dataset_text_field="text", max_seq_length=2048, ) trainer.train() trainer.save_model("./adapter-final") ``` Run: ```bash python train.py # Watch loss decrease over epochs # Output: ./adapter-final/adapter_model.bin (~100 MB) ``` ## Step 4 — Inference with the adapter ```python from peft import PeftModel model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=bnb_config, device_map="auto", ) model = PeftModel.from_pretrained(model, "./adapter-final") def generate(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=300) return tokenizer.decode(outputs[0], skip_special_tokens=True) print(generate("### Instruction: Reply about VPS ### Input: How to install Docker? ### Response:")) ``` ## Step 5 — Merge adapter for production To simplify serving, merge the LoRA adapter into the base weights: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained("./adapter-final") merged = model.merge_and_unload() merged.save_pretrained("./merged-model") tokenizer.save_pretrained("./merged-model") ``` Now `./merged-model` is a standalone Llama model tuned to your domain. Upload to Hugging Face or serve via Ollama / vLLM (see [Self-hosting LLMs guide](https://domainindia.com/support/kb/self-hosting-llama-mistral-ollama-vllm-vps)). ## Step 6 — Evaluate Hold out 10% of your dataset as eval set. Measure: - **Exact match** — output exactly matches expected - **BLEU / ROUGE** — similarity scores - **Human eval** — does a reviewer prefer fine-tuned output? Simple script: ```python from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True) scores = [] for row in eval_set: predicted = generate(row['prompt']) s = scorer.score(row['expected'], predicted) scores.append(s['rougeL'].fmeasure) print(f"Average ROUGE-L: {sum(scores)/len(scores):.3f}") ``` ## Common pitfalls ## FAQ
Q Fine-tuning cost?

For a 7B model with 1000 examples: ~$2-5 on rented GPU. Scales linearly. Open-source models have no per-token licensing.

Q Can I fine-tune GPT-4 / Claude?

OpenAI fine-tuning is available ($$, per-token during training). Claude not publicly offered yet. For customisation at cost, open-source LLMs + LoRA win.

Q Should I fine-tune for RAG?

Generally no. RAG + a good general model outperforms fine-tuned-without-RAG for factual queries. Fine-tune for style/format; use RAG for knowledge.

Q How often to re-train?

When base model updates (new Llama release) or data drifts. Every 3-6 months is typical.

Q What about DPO / RLHF?

Advanced techniques for alignment. Use LoRA first to see if supervised fine-tuning meets your need; DPO if you need preference-based tuning.

Production LLM fine-tuning needs GPU — rent by the hour or order a GPU VPS. Explore VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket