Client Area

Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki

ByDomain India Team·DomainIndia Engineering
6 min read24 Apr 20264 views
# Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki
TL;DR
You can't fix what you can't see. This guide sets up the full observability stack — Prometheus for metrics, Grafana for dashboards, Loki for logs — on a DomainIndia VPS. Plus alerting with Alertmanager and uptime monitoring that pages you before customers complain.
## The three pillars of observability
PillarToolAnswers
MetricsPrometheus"How fast? How many? How often?"
LogsLoki"What happened? What did it say?"
TracesJaeger / Tempo"Where did this slow request spend its time?"
Start with metrics + logs. Add traces when you have a microservices architecture where single-request paths span multiple services. ## What to monitor The **Four Golden Signals** (Google SRE): 1. **Latency** — how long requests take (p50, p95, p99) 2. **Traffic** — requests per second 3. **Errors** — failure rate 4. **Saturation** — resource utilisation (CPU, RAM, disk, queue depth) Plus infrastructure: - CPU / RAM / disk / network per VPS - Database connections, slow query count - Cache hit rate - Queue size (Sidekiq, BullMQ) - External API latency ## Option A — Self-host stack (VPS) For a DomainIndia VPS setup, install Prometheus + Grafana + Loki on same VPS or dedicated monitoring VPS. ### Step 1 — Install Prometheus ```bash wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz tar xzf prometheus-2.52.0.linux-amd64.tar.gz sudo mv prometheus-2.52.0.linux-amd64 /opt/prometheus sudo useradd -r prometheus sudo chown -R prometheus:prometheus /opt/prometheus ``` `/opt/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['localhost:9100', 'vps2.internal:9100', 'vps3.internal:9100'] - job_name: 'app' metrics_path: '/metrics' static_configs: - targets: ['app-vps:8080'] - job_name: 'postgres' static_configs: - targets: ['db-vps:9187'] alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] rule_files: - 'alerts.yml' ``` systemd service `/etc/systemd/system/prometheus.service`: ```ini [Unit] Description=Prometheus After=network.target [Service] User=prometheus ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/opt/prometheus/data --web.listen-address=127.0.0.1:9090 --storage.tsdb.retention.time=30d Restart=on-failure [Install] WantedBy=multi-user.target ``` ### Step 2 — Node Exporter (per VPS) On every VPS you want to monitor: ```bash wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz tar xzf node_exporter-*.tar.gz sudo mv node_exporter-*/node_exporter /usr/local/bin/ sudo useradd -r node_exporter ``` systemd: ```ini [Service] User=node_exporter ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 Restart=on-failure ``` Exposes CPU, RAM, disk, network, systemd status, filesystem — 100+ metrics out of the box. ### Step 3 — Install Grafana ```bash # AlmaLinux sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.4.0-1.x86_64.rpm # Ubuntu wget -qO - https://apt.grafana.com/gpg.key | sudo apt-key add - echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt update && sudo apt install -y grafana sudo systemctl enable --now grafana-server ``` Access `http://your-vps-ip:3000` (default admin/admin — change on first login). Add Prometheus datasource: `http://localhost:9090`. Import dashboards: #1860 (Node Exporter Full), #9628 (PostgreSQL), #7587 (nginx). ### Step 4 — Install Loki + Promtail Loki (log aggregator): ```bash # Loki server wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-linux-amd64.zip unzip loki-linux-amd64.zip sudo mv loki-linux-amd64 /usr/local/bin/loki # Minimal config: /etc/loki/config.yml ``` Promtail (log shipper, runs on each VPS): ```yaml # /etc/promtail/config.yml server: http_listen_port: 9080 positions: filename: /var/lib/promtail/positions.yaml clients: - url: http://loki.yourcompany.com:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: [localhost] labels: job: varlogs host: vps1 __path__: /var/log/*log - job_name: app static_configs: - targets: [localhost] labels: job: app host: vps1 __path__: /home/app/logs/*.log ``` systemd for Promtail: ```ini [Service] ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml Restart=on-failure ``` Add Loki datasource in Grafana (`http://localhost:3100`). Query logs with LogQL: ```logql {job="app"} |= "error" ``` ## Option B — Grafana Cloud Free Tier If VPS resources are tight, Grafana Cloud has a free plan: - 10K metrics series - 50 GB logs - 14-day retention - Unlimited dashboards Install `grafana-agent` on your VPS, point at Cloud — instant observability, no self-hosting burden. ## Step 5 — Instrumenting your app **Node.js:** ```javascript import express from 'express'; import client from 'prom-client'; const register = new client.Registry(); client.collectDefaultMetrics({ register }); const httpRequestsTotal = new client.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'], registers: [register], }); const httpDuration = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration', labelNames: ['method', 'route', 'status'], buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], registers: [register], }); const app = express(); app.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; const labels = { method: req.method, route: req.route?.path || req.path, status: res.statusCode }; httpRequestsTotal.inc(labels); httpDuration.observe(labels, duration); }); next(); }); app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.send(await register.metrics()); }); ``` **Python (FastAPI):** ```python from prometheus_fastapi_instrumentator import Instrumentator Instrumentator().instrument(app).expose(app) ``` **PHP:** ```bash composer require promphp/prometheus_client_php ``` ```php use PrometheusCollectorRegistry; use PrometheusStorageRedis; $adapter = new Redis(['host' => 'localhost']); $registry = new CollectorRegistry($adapter); $counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests', ['route']); $counter->inc(['/api/users']); // At /metrics endpoint: echo $registry->getMetricFamilySamples(); ``` ## Step 6 — Alertmanager Alert when things break. `/opt/prometheus/alerts.yml`: ```yaml groups: - name: infrastructure rules: - alert: HighCPU expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: { severity: warning } annotations: summary: "CPU > 80% on {{ $labels.instance }}" - alert: DiskFull expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10 for: 5m labels: { severity: critical } annotations: summary: "Disk < 10% on {{ $labels.instance }}" - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: { severity: critical } annotations: summary: "Error rate > 5% on {{ $labels.route }}" - alert: PodDown expr: up == 0 for: 2m labels: { severity: critical } annotations: summary: "{{ $labels.job }} on {{ $labels.instance }} is DOWN" ``` Alertmanager config `/etc/alertmanager/alertmanager.yml`: ```yaml route: receiver: default group_by: ['alertname', 'instance'] group_wait: 30s receivers: - name: default email_configs: - to: '[email protected]' from: '[email protected]' smarthost: 'smtp.gmail.com:587' auth_username: '...' auth_password: '...' webhook_configs: - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK' ``` ## Step 7 — Uptime monitoring Bonus: external uptime check. Self-hosted option is **Uptime Kuma**: ```bash docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1 ``` Configure: HTTP checks, keyword search, SSL cert expiry, DNS. Alerts via Slack/email/Telegram/SMS. ## Common pitfalls ## FAQ
Q How much RAM for this stack?

Small deployment (1-3 VPS monitored, 30d retention): 2 GB VPS is enough. Larger (10+ VPS, long retention): 4+ GB dedicated monitoring VPS.

Q Datadog/New Relic or self-host?

Self-host if you have the ops capacity and predictable cost is important. Managed (Datadog, New Relic, Grafana Cloud) if you want zero-ops, willing to pay per host/metric.

Q APM — application performance monitoring?

OpenTelemetry is the standard. Tempo (traces) + Loki (logs) + Prometheus (metrics) integrate natively in Grafana. Instrument your app once, query everything.

Q How do I monitor Cloudflare/edge?

Cloudflare dashboard has its own analytics. For combined view, use Cloudflare's Logpush to send events to your Loki.

Q Do I need this for a small website?

At small scale: just Uptime Kuma + Node Exporter + basic dashboards is enough. Scale the stack as you grow.

Monitor your entire DomainIndia fleet from one VPS. Order VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket