Production Observability — Prometheus + Grafana + Loki on VPS

# Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki

TL;DR

You can't fix what you can't see. This guide sets up the full observability stack — Prometheus for metrics, Grafana for dashboards, Loki for logs — on a DomainIndia VPS. Plus alerting with Alertmanager and uptime monitoring that pages you before customers complain.

## The three pillars of observability

Pillar	Tool	Answers
Metrics	Prometheus	"How fast? How many? How often?"
Logs	Loki	"What happened? What did it say?"
Traces	Jaeger / Tempo	"Where did this slow request spend its time?"

Start with metrics + logs. Add traces when you have a microservices architecture where single-request paths span multiple services. ## What to monitor The **Four Golden Signals** (Google SRE): 1. **Latency** — how long requests take (p50, p95, p99) 2. **Traffic** — requests per second 3. **Errors** — failure rate 4. **Saturation** — resource utilisation (CPU, RAM, disk, queue depth) Plus infrastructure: - CPU / RAM / disk / network per VPS - Database connections, slow query count - Cache hit rate - Queue size (Sidekiq, BullMQ) - External API latency ## Option A — Self-host stack (VPS) For a DomainIndia VPS setup, install Prometheus + Grafana + Loki on same VPS or dedicated monitoring VPS. ### Step 1 — Install Prometheus ```bash wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz tar xzf prometheus-2.52.0.linux-amd64.tar.gz sudo mv prometheus-2.52.0.linux-amd64 /opt/prometheus sudo useradd -r prometheus sudo chown -R prometheus:prometheus /opt/prometheus ``` `/opt/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['localhost:9100', 'vps2.internal:9100', 'vps3.internal:9100'] - job_name: 'app' metrics_path: '/metrics' static_configs: - targets: ['app-vps:8080'] - job_name: 'postgres' static_configs: - targets: ['db-vps:9187'] alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] rule_files: - 'alerts.yml' ``` systemd service `/etc/systemd/system/prometheus.service`: ```ini [Unit] Description=Prometheus After=network.target [Service] User=prometheus ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/opt/prometheus/data --web.listen-address=127.0.0.1:9090 --storage.tsdb.retention.time=30d Restart=on-failure [Install] WantedBy=multi-user.target ``` ### Step 2 — Node Exporter (per VPS) On every VPS you want to monitor: ```bash wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz tar xzf node_exporter-*.tar.gz sudo mv node_exporter-*/node_exporter /usr/local/bin/ sudo useradd -r node_exporter ``` systemd: ```ini [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 Restart=on-failure ``` Exposes CPU, RAM, disk, network, systemd status, filesystem — 100+ metrics out of the box. ### Step 3 — Install Grafana ```bash # AlmaLinux sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.4.0-1.x86_64.rpm # Ubuntu wget -qO - https://apt.grafana.com/gpg.key | sudo apt-key add - echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt update && sudo apt install -y grafana sudo systemctl enable --now grafana-server ``` Access `http://your-vps-ip:3000` (default admin/admin — change on first login). Add Prometheus datasource: `http://localhost:9090`. Import dashboards: #1860 (Node Exporter Full), #9628 (PostgreSQL), #7587 (nginx). ### Step 4 — Install Loki + Promtail Loki (log aggregator): ```bash # Loki server wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-linux-amd64.zip unzip loki-linux-amd64.zip sudo mv loki-linux-amd64 /usr/local/bin/loki # Minimal config: /etc/loki/config.yml ``` Promtail (log shipper, runs on each VPS): ```yaml # /etc/promtail/config.yml server: http_listen_port: 9080 positions: filename: /var/lib/promtail/positions.yaml clients: - url: http://loki.yourcompany.com:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: [localhost] labels: job: varlogs host: vps1 __path__: /var/log/*log - job_name: app static_configs: - targets: [localhost] labels: job: app host: vps1 __path__: /home/app/logs/*.log ``` systemd for Promtail: ```ini [Service] ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml Restart=on-failure ``` Add Loki datasource in Grafana (`http://localhost:3100`). Query logs with LogQL: ```logql {job="app"} |= "error" ``` ## Option B — Grafana Cloud Free Tier If VPS resources are tight, Grafana Cloud has a free plan: - 10K metrics series - 50 GB logs - 14-day retention - Unlimited dashboards Install `grafana-agent` on your VPS, point at Cloud — instant observability, no self-hosting burden. ## Step 5 — Instrumenting your app **Node.js:** ```javascript import express from 'express'; import client from 'prom-client'; const register = new client.Registry(); client.collectDefaultMetrics({ register }); const httpRequestsTotal = new client.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'], registers: [register], }); const httpDuration = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration', labelNames: ['method', 'route', 'status'], buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], registers: [register], }); const app = express(); app.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; const labels = { method: req.method, route: req.route?.path || req.path, status: res.statusCode }; httpRequestsTotal.inc(labels); httpDuration.observe(labels, duration); }); next(); }); app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.send(await register.metrics()); }); ``` **Python (FastAPI):** ```python from prometheus_fastapi_instrumentator import Instrumentator Instrumentator().instrument(app).expose(app) ``` **PHP:** ```bash composer require promphp/prometheus_client_php ``` ```php use PrometheusCollectorRegistry; use PrometheusStorageRedis; $adapter = new Redis(['host' => 'localhost']); $registry = new CollectorRegistry($adapter); $counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests', ['route']); $counter->inc(['/api/users']); // At /metrics endpoint: echo $registry->getMetricFamilySamples(); ``` ## Step 6 — Alertmanager Alert when things break. `/opt/prometheus/alerts.yml`: ```yaml groups: - name: infrastructure rules: - alert: HighCPU expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: { severity: warning } annotations: summary: "CPU > 80% on {{ $labels.instance }}" - alert: DiskFull expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10 for: 5m labels: { severity: critical } annotations: summary: "Disk < 10% on {{ $labels.instance }}" - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: { severity: critical } annotations: summary: "Error rate > 5% on {{ $labels.route }}" - alert: PodDown expr: up == 0 for: 2m labels: { severity: critical } annotations: summary: "{{ $labels.job }} on {{ $labels.instance }} is DOWN" ``` Alertmanager config `/etc/alertmanager/alertmanager.yml`: ```yaml route: receiver: default group_by: ['alertname', 'instance'] group_wait: 30s receivers: - name: default email_configs: - to: '[email protected]' from: '[email protected]' smarthost: 'smtp.gmail.com:587' auth_username: '...' auth_password: '...' webhook_configs: - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK' ``` ## Step 7 — Uptime monitoring Bonus: external uptime check. Self-hosted option is **Uptime Kuma**: ```bash docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1 ``` Configure: HTTP checks, keyword search, SSL cert expiry, DNS. Alerts via Slack/email/Telegram/SMS. ## Common pitfalls ## FAQ

Q How much RAM for this stack?

Small deployment (1-3 VPS monitored, 30d retention): 2 GB VPS is enough. Larger (10+ VPS, long retention): 4+ GB dedicated monitoring VPS.

Q Datadog/New Relic or self-host?

Self-host if you have the ops capacity and predictable cost is important. Managed (Datadog, New Relic, Grafana Cloud) if you want zero-ops, willing to pay per host/metric.

Q APM — application performance monitoring?

OpenTelemetry is the standard. Tempo (traces) + Loki (logs) + Prometheus (metrics) integrate natively in Grafana. Instrument your app once, query everything.

Q How do I monitor Cloudflare/edge?

Cloudflare dashboard has its own analytics. For combined view, use Cloudflare's Logpush to send events to your Loki.

Q Do I need this for a small website?

At small scale: just Uptime Kuma + Node Exporter + basic dashboards is enough. Scale the stack as you grow.

Monitor your entire DomainIndia fleet from one VPS. Order VPS

Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki

Was this article helpful?

Related Articles

Still need help?