Client Area

Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki

ByDomain India Team·DomainIndia Engineering
6 min readPublished 24 Apr 2026Updated 23 Jun 2026210 views

In this article

  • 1The three pillars of observability
  • 2What to monitor
  • 3Option A — Self-host stack (VPS)
  • 4Step 1 — Install Prometheus
  • 5Step 2 — Node Exporter (per VPS)

Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki

TL;DR
You can't fix what you can't see. This guide sets up the full observability stack — Prometheus for metrics, Grafana for dashboards, Loki for logs — on a DomainIndia VPS. Plus alerting with Alertmanager and uptime monitoring that pages you before customers complain.

The three pillars of observability

PillarToolAnswers
MetricsPrometheus"How fast? How many? How often?"
LogsLoki"What happened? What did it say?"
TracesJaeger / Tempo"Where did this slow request spend its time?"

Start with metrics + logs. Add traces when you have a microservices architecture where single-request paths span multiple services.

What to monitor

The Four Golden Signals (Google SRE):

  1. Latency — how long requests take (p50, p95, p99)
  2. Traffic — requests per second
  3. Errors — failure rate
  4. Saturation — resource utilisation (CPU, RAM, disk, queue depth)

Plus infrastructure:

  • CPU / RAM / disk / network per VPS
  • Database connections, slow query count
  • Cache hit rate
  • Queue size (Sidekiq, BullMQ)
  • External API latency

Option A — Self-host stack (VPS)

For a DomainIndia VPS setup, install Prometheus + Grafana + Loki on same VPS or dedicated monitoring VPS.

Step 1 — Install Prometheus

bash
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xzf prometheus-2.52.0.linux-amd64.tar.gz
sudo mv prometheus-2.52.0.linux-amd64 /opt/prometheus
sudo useradd -r prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

/opt/prometheus/prometheus.yml:

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100', 'vps2.internal:9100', 'vps3.internal:9100']

  - job_name: 'app'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app-vps:8080']

  - job_name: 'postgres'
    static_configs:
      - targets: ['db-vps:9187']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'alerts.yml'

systemd service /etc/systemd/system/prometheus.service:

ini
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus 
    --config.file=/opt/prometheus/prometheus.yml 
    --storage.tsdb.path=/opt/prometheus/data 
    --web.listen-address=127.0.0.1:9090 
    --storage.tsdb.retention.time=30d
Restart=on-failure
[Install]
WantedBy=multi-user.target

Step 2 — Node Exporter (per VPS)

On every VPS you want to monitor:

bash
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -r node_exporter

systemd:

ini
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=on-failure

Exposes CPU, RAM, disk, network, systemd status, filesystem — 100+ metrics out of the box.

Step 3 — Install Grafana

bash
# AlmaLinux
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.4.0-1.x86_64.rpm

# Ubuntu
wget -qO - https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana

sudo systemctl enable --now grafana-server

Access http://your-vps-ip:3000 (default admin/admin — change on first login).

Add Prometheus datasource: http://localhost:9090.

Import dashboards: #1860 (Node Exporter Full), #9628 (PostgreSQL), #7587 (nginx).

Step 4 — Install Loki + Promtail

Loki (log aggregator):

bash
# Loki server
wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki

# Minimal config: /etc/loki/config.yml

Promtail (log shipper, runs on each VPS):

yaml
# /etc/promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://loki.yourcompany.com:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          host: vps1
          __path__: /var/log/*log

  - job_name: app
    static_configs:
      - targets: [localhost]
        labels:
          job: app
          host: vps1
          __path__: /home/app/logs/*.log

systemd for Promtail:

ini
[Service]
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=on-failure

Add Loki datasource in Grafana (http://localhost:3100). Query logs with LogQL:

logql
{job="app"} |= "error"

Option B — Grafana Cloud Free Tier

If VPS resources are tight, Grafana Cloud has a free plan:

  • 10K metrics series
  • 50 GB logs
  • 14-day retention
  • Unlimited dashboards

Install grafana-agent on your VPS, point at Cloud — instant observability, no self-hosting burden.

Step 5 — Instrumenting your app

Node.js:

javascript
import express from 'express';
import client from 'prom-client';

const register = new client.Registry();
client.collectDefaultMetrics({ register });

const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register],
});

const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

const app = express();

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = { method: req.method, route: req.route?.path || req.path, status: res.statusCode };
    httpRequestsTotal.inc(labels);
    httpDuration.observe(labels, duration);
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.send(await register.metrics());
});

Python (FastAPI):

python
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

PHP:

bash
composer require promphp/prometheus_client_php
php
use PrometheusCollectorRegistry;
use PrometheusStorageRedis;

$adapter = new Redis(['host' => 'localhost']);
$registry = new CollectorRegistry($adapter);

$counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests', ['route']);
$counter->inc(['/api/users']);

// At /metrics endpoint:
echo $registry->getMetricFamilySamples();

Step 6 — Alertmanager

Alert when things break. /opt/prometheus/alerts.yml:

yaml
groups:
  - name: infrastructure
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "CPU > 80% on {{ $labels.instance }}"

      - alert: DiskFull
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Disk < 10% on {{ $labels.instance }}"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Error rate > 5% on {{ $labels.route }}"

      - alert: PodDown
        expr: up == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "{{ $labels.job }} on {{ $labels.instance }} is DOWN"

Alertmanager config /etc/alertmanager/alertmanager.yml:

yaml
route:
  receiver: default
  group_by: ['alertname', 'instance']
  group_wait: 30s

receivers:
  - name: default
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '...'
        auth_password: '...'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'

Step 7 — Uptime monitoring

Bonus: external uptime check. Self-hosted option is Uptime Kuma:

bash
docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1

Configure: HTTP checks, keyword search, SSL cert expiry, DNS. Alerts via Slack/email/Telegram/SMS.

Common pitfalls

FAQ

Q How much RAM for this stack?

Small deployment (1-3 VPS monitored, 30d retention): 2 GB VPS is enough. Larger (10+ VPS, long retention): 4+ GB dedicated monitoring VPS.

Q Datadog/New Relic or self-host?

Self-host if you have the ops capacity and predictable cost is important. Managed (Datadog, New Relic, Grafana Cloud) if you want zero-ops, willing to pay per host/metric.

Q APM — application performance monitoring?

OpenTelemetry is the standard. Tempo (traces) + Loki (logs) + Prometheus (metrics) integrate natively in Grafana. Instrument your app once, query everything.

Q How do I monitor Cloudflare/edge?

Cloudflare dashboard has its own analytics. For combined view, use Cloudflare's Logpush to send events to your Loki.

Q Do I need this for a small website?

At small scale: just Uptime Kuma + Node Exporter + basic dashboards is enough. Scale the stack as you grow.

Monitor your entire DomainIndia fleet from one VPS. Order VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket