Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki
The three pillars of observability
| Pillar | Tool | Answers |
|---|---|---|
| Metrics | Prometheus | "How fast? How many? How often?" |
| Logs | Loki | "What happened? What did it say?" |
| Traces | Jaeger / Tempo | "Where did this slow request spend its time?" |
Start with metrics + logs. Add traces when you have a microservices architecture where single-request paths span multiple services.
What to monitor
The Four Golden Signals (Google SRE):
- Latency — how long requests take (p50, p95, p99)
- Traffic — requests per second
- Errors — failure rate
- Saturation — resource utilisation (CPU, RAM, disk, queue depth)
Plus infrastructure:
- CPU / RAM / disk / network per VPS
- Database connections, slow query count
- Cache hit rate
- Queue size (Sidekiq, BullMQ)
- External API latency
Option A — Self-host stack (VPS)
For a DomainIndia VPS setup, install Prometheus + Grafana + Loki on same VPS or dedicated monitoring VPS.
Step 1 — Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xzf prometheus-2.52.0.linux-amd64.tar.gz
sudo mv prometheus-2.52.0.linux-amd64 /opt/prometheus
sudo useradd -r prometheus
sudo chown -R prometheus:prometheus /opt/prometheus/opt/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100', 'vps2.internal:9100', 'vps3.internal:9100']
- job_name: 'app'
metrics_path: '/metrics'
static_configs:
- targets: ['app-vps:8080']
- job_name: 'postgres'
static_configs:
- targets: ['db-vps:9187']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'alerts.yml'systemd service /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus
--config.file=/opt/prometheus/prometheus.yml
--storage.tsdb.path=/opt/prometheus/data
--web.listen-address=127.0.0.1:9090
--storage.tsdb.retention.time=30d
Restart=on-failure
[Install]
WantedBy=multi-user.targetStep 2 — Node Exporter (per VPS)
On every VPS you want to monitor:
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -r node_exportersystemd:
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=on-failureExposes CPU, RAM, disk, network, systemd status, filesystem — 100+ metrics out of the box.
Step 3 — Install Grafana
# AlmaLinux
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.4.0-1.x86_64.rpm
# Ubuntu
wget -qO - https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-serverAccess http://your-vps-ip:3000 (default admin/admin — change on first login).
Add Prometheus datasource: http://localhost:9090.
Import dashboards: #1860 (Node Exporter Full), #9628 (PostgreSQL), #7587 (nginx).
Step 4 — Install Loki + Promtail
Loki (log aggregator):
# Loki server
wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
# Minimal config: /etc/loki/config.ymlPromtail (log shipper, runs on each VPS):
# /etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://loki.yourcompany.com:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: vps1
__path__: /var/log/*log
- job_name: app
static_configs:
- targets: [localhost]
labels:
job: app
host: vps1
__path__: /home/app/logs/*.logsystemd for Promtail:
[Service]
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=on-failureAdd Loki datasource in Grafana (http://localhost:3100). Query logs with LogQL:
{job="app"} |= "error"Option B — Grafana Cloud Free Tier
If VPS resources are tight, Grafana Cloud has a free plan:
- 10K metrics series
- 50 GB logs
- 14-day retention
- Unlimited dashboards
Install grafana-agent on your VPS, point at Cloud — instant observability, no self-hosting burden.
Step 5 — Instrumenting your app
Node.js:
import express from 'express';
import client from 'prom-client';
const register = new client.Registry();
client.collectDefaultMetrics({ register });
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
registers: [register],
});
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
const app = express();
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = { method: req.method, route: req.route?.path || req.path, status: res.statusCode };
httpRequestsTotal.inc(labels);
httpDuration.observe(labels, duration);
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});Python (FastAPI):
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)PHP:
composer require promphp/prometheus_client_phpuse PrometheusCollectorRegistry;
use PrometheusStorageRedis;
$adapter = new Redis(['host' => 'localhost']);
$registry = new CollectorRegistry($adapter);
$counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests', ['route']);
$counter->inc(['/api/users']);
// At /metrics endpoint:
echo $registry->getMetricFamilySamples();Step 6 — Alertmanager
Alert when things break. /opt/prometheus/alerts.yml:
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels: { severity: warning }
annotations:
summary: "CPU > 80% on {{ $labels.instance }}"
- alert: DiskFull
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10
for: 5m
labels: { severity: critical }
annotations:
summary: "Disk < 10% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "Error rate > 5% on {{ $labels.route }}"
- alert: PodDown
expr: up == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "{{ $labels.job }} on {{ $labels.instance }} is DOWN"Alertmanager config /etc/alertmanager/alertmanager.yml:
route:
receiver: default
group_by: ['alertname', 'instance']
group_wait: 30s
receivers:
- name: default
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '...'
auth_password: '...'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'Step 7 — Uptime monitoring
Bonus: external uptime check. Self-hosted option is Uptime Kuma:
docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1Configure: HTTP checks, keyword search, SSL cert expiry, DNS. Alerts via Slack/email/Telegram/SMS.
Common pitfalls
FAQ
Small deployment (1-3 VPS monitored, 30d retention): 2 GB VPS is enough. Larger (10+ VPS, long retention): 4+ GB dedicated monitoring VPS.
Self-host if you have the ops capacity and predictable cost is important. Managed (Datadog, New Relic, Grafana Cloud) if you want zero-ops, willing to pay per host/metric.
OpenTelemetry is the standard. Tempo (traces) + Loki (logs) + Prometheus (metrics) integrate natively in Grafana. Instrument your app once, query everything.
Cloudflare dashboard has its own analytics. For combined view, use Cloudflare's Logpush to send events to your Loki.
At small scale: just Uptime Kuma + Node Exporter + basic dashboards is enough. Scale the stack as you grow.
Monitor your entire DomainIndia fleet from one VPS. Order VPS