# Production Observability on DomainIndia VPS: Prometheus, Grafana, and Loki
TL;DR
You can't fix what you can't see. This guide sets up the full observability stack — Prometheus for metrics, Grafana for dashboards, Loki for logs — on a DomainIndia VPS. Plus alerting with Alertmanager and uptime monitoring that pages you before customers complain.
## The three pillars of observability
| Pillar | Tool | Answers |
| Metrics | Prometheus | "How fast? How many? How often?" |
| Logs | Loki | "What happened? What did it say?" |
| Traces | Jaeger / Tempo | "Where did this slow request spend its time?" |
Start with metrics + logs. Add traces when you have a microservices architecture where single-request paths span multiple services.
## What to monitor
The **Four Golden Signals** (Google SRE):
1. **Latency** — how long requests take (p50, p95, p99)
2. **Traffic** — requests per second
3. **Errors** — failure rate
4. **Saturation** — resource utilisation (CPU, RAM, disk, queue depth)
Plus infrastructure:
- CPU / RAM / disk / network per VPS
- Database connections, slow query count
- Cache hit rate
- Queue size (Sidekiq, BullMQ)
- External API latency
## Option A — Self-host stack (VPS)
For a DomainIndia VPS setup, install Prometheus + Grafana + Loki on same VPS or dedicated monitoring VPS.
### Step 1 — Install Prometheus
```bash
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xzf prometheus-2.52.0.linux-amd64.tar.gz
sudo mv prometheus-2.52.0.linux-amd64 /opt/prometheus
sudo useradd -r prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
```
`/opt/prometheus/prometheus.yml`:
```yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100', 'vps2.internal:9100', 'vps3.internal:9100']
- job_name: 'app'
metrics_path: '/metrics'
static_configs:
- targets: ['app-vps:8080']
- job_name: 'postgres'
static_configs:
- targets: ['db-vps:9187']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'alerts.yml'
```
systemd service `/etc/systemd/system/prometheus.service`:
```ini
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus
--config.file=/opt/prometheus/prometheus.yml
--storage.tsdb.path=/opt/prometheus/data
--web.listen-address=127.0.0.1:9090
--storage.tsdb.retention.time=30d
Restart=on-failure
[Install]
WantedBy=multi-user.target
```
### Step 2 — Node Exporter (per VPS)
On every VPS you want to monitor:
```bash
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -r node_exporter
```
systemd:
```ini
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=on-failure
```
Exposes CPU, RAM, disk, network, systemd status, filesystem — 100+ metrics out of the box.
### Step 3 — Install Grafana
```bash
# AlmaLinux
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.4.0-1.x86_64.rpm
# Ubuntu
wget -qO - https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server
```
Access `http://your-vps-ip:3000` (default admin/admin — change on first login).
Add Prometheus datasource: `http://localhost:9090`.
Import dashboards: #1860 (Node Exporter Full), #9628 (PostgreSQL), #7587 (nginx).
### Step 4 — Install Loki + Promtail
Loki (log aggregator):
```bash
# Loki server
wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
# Minimal config: /etc/loki/config.yml
```
Promtail (log shipper, runs on each VPS):
```yaml
# /etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://loki.yourcompany.com:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: vps1
__path__: /var/log/*log
- job_name: app
static_configs:
- targets: [localhost]
labels:
job: app
host: vps1
__path__: /home/app/logs/*.log
```
systemd for Promtail:
```ini
[Service]
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=on-failure
```
Add Loki datasource in Grafana (`http://localhost:3100`). Query logs with LogQL:
```logql
{job="app"} |= "error"
```
## Option B — Grafana Cloud Free Tier
If VPS resources are tight, Grafana Cloud has a free plan:
- 10K metrics series
- 50 GB logs
- 14-day retention
- Unlimited dashboards
Install `grafana-agent` on your VPS, point at Cloud — instant observability, no self-hosting burden.
## Step 5 — Instrumenting your app
**Node.js:**
```javascript
import express from 'express';
import client from 'prom-client';
const register = new client.Registry();
client.collectDefaultMetrics({ register });
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
registers: [register],
});
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
const app = express();
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = { method: req.method, route: req.route?.path || req.path, status: res.statusCode };
httpRequestsTotal.inc(labels);
httpDuration.observe(labels, duration);
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});
```
**Python (FastAPI):**
```python
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)
```
**PHP:**
```bash
composer require promphp/prometheus_client_php
```
```php
use PrometheusCollectorRegistry;
use PrometheusStorageRedis;
$adapter = new Redis(['host' => 'localhost']);
$registry = new CollectorRegistry($adapter);
$counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests', ['route']);
$counter->inc(['/api/users']);
// At /metrics endpoint:
echo $registry->getMetricFamilySamples();
```
## Step 6 — Alertmanager
Alert when things break. `/opt/prometheus/alerts.yml`:
```yaml
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels: { severity: warning }
annotations:
summary: "CPU > 80% on {{ $labels.instance }}"
- alert: DiskFull
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10
for: 5m
labels: { severity: critical }
annotations:
summary: "Disk < 10% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "Error rate > 5% on {{ $labels.route }}"
- alert: PodDown
expr: up == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "{{ $labels.job }} on {{ $labels.instance }} is DOWN"
```
Alertmanager config `/etc/alertmanager/alertmanager.yml`:
```yaml
route:
receiver: default
group_by: ['alertname', 'instance']
group_wait: 30s
receivers:
- name: default
email_configs:
- to: '
[email protected]'
from: '
[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '...'
auth_password: '...'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
```
## Step 7 — Uptime monitoring
Bonus: external uptime check. Self-hosted option is **Uptime Kuma**:
```bash
docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1
```
Configure: HTTP checks, keyword search, SSL cert expiry, DNS. Alerts via Slack/email/Telegram/SMS.
## Common pitfalls
## FAQ
Q
How much RAM for this stack?
Small deployment (1-3 VPS monitored, 30d retention): 2 GB VPS is enough. Larger (10+ VPS, long retention): 4+ GB dedicated monitoring VPS.
Q
Datadog/New Relic or self-host?
Self-host if you have the ops capacity and predictable cost is important. Managed (Datadog, New Relic, Grafana Cloud) if you want zero-ops, willing to pay per host/metric.
Q
APM — application performance monitoring?
OpenTelemetry is the standard. Tempo (traces) + Loki (logs) + Prometheus (metrics) integrate natively in Grafana. Instrument your app once, query everything.
Q
How do I monitor Cloudflare/edge?
Cloudflare dashboard has its own analytics. For combined view, use Cloudflare's Logpush to send events to your Loki.
Q
Do I need this for a small website?
At small scale: just Uptime Kuma + Node Exporter + basic dashboards is enough. Scale the stack as you grow.
Monitor your entire DomainIndia fleet from one VPS.
Order VPS