# Rust for Web Scraping and Data Pipelines on DomainIndia VPS
TL;DR
Rust's speed + memory safety makes it ideal for data pipelines that scrape, transform, or ingest millions of records. This guide shows async scraping with reqwest + scraper, parsing JSON/CSV at scale, writing to PostgreSQL with sqlx, and scheduling on DomainIndia VPS.
## Why Rust for data pipelines
Typical scraping/ETL workloads in Python:
- scrape 10K URLs → takes 2 hours (requests + BeautifulSoup, single-threaded)
- parse 5 GB JSON → 20 minutes (Python's json parser)
- load into DB → bottlenecked by per-row inserts
Same workloads in Rust:
- 10K URLs → 3 minutes (async reqwest, 100 concurrent)
- 5 GB JSON → 90 seconds (serde_json + rayon)
- DB load → 30 seconds with `COPY FROM` + binary format
Rust wins when you're CPU/IO-bound on data work. For light scraping (100 URLs/day), Python is fine — less code to write.
## Project scaffold
```bash
cargo new --bin scraper && cd scraper
```
`Cargo.toml`:
```toml
[dependencies]
tokio = { version = "1.36", features = ["full"] }
reqwest = { version = "0.12", features = ["json", "gzip", "rustls-tls"] }
scraper = "0.19" # HTML parsing
serde = { version = "1", features = ["derive"] }
serde_json = "1"
anyhow = "1" # error handling
tracing = "0.1"
tracing-subscriber = "0.3"
futures = "0.3"
sqlx = { version = "0.7", features = ["runtime-tokio-rustls", "postgres", "chrono"] }
csv = "1.3"
```
## Pattern 1 — Async bulk scraping
Scrape 10,000 URLs concurrently, extract title + price, save to CSV.
```rust
use futures::stream::{self, StreamExt};
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
#[derive(Serialize, Debug)]
struct Product {
url: String,
title: String,
price: Option,
}
async fn scrape_one(client: &Client, url: &str) -> anyhow::Result {
let html = client.get(url).send().await?.text().await?;
let doc = Html::parse_document(&html);
let title_sel = Selector::parse("h1.product-title").unwrap();
let price_sel = Selector::parse("span.price").unwrap();
let title = doc.select(&title_sel)
.next()
.map(|n| n.text().collect::().trim().to_string())
.unwrap_or_default();
let price = doc.select(&price_sel)
.next()
.map(|n| n.text().collect::().trim().to_string());
Ok(Product { url: url.to_string(), title, price })
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
tracing_subscriber::fmt::init();
let urls: Vec = std::fs::read_to_string("urls.txt")?
.lines().map(String::from).collect();
let client = Client::builder()
.user_agent("MyScraper/1.0 (+https://yourcompany.com/bot)")
.timeout(std::time::Duration::from_secs(10))
.gzip(true)
.build()?;
let results: Vec = stream::iter(urls.iter())
.map(|url| {
let client = &client;
async move {
scrape_one(client, url).await
.unwrap_or_else(|e| {
tracing::warn!("failed {url}: {e}");
Product { url: url.to_string(), title: String::new(), price: None }
})
}
})
.buffer_unordered(50) // 50 concurrent requests
.collect()
.await;
let mut wtr = csv::Writer::from_path("products.csv")?;
for p in &results { wtr.serialize(p)?; }
wtr.flush()?;
tracing::info!("scraped {} products", results.len());
Ok(())
}
```
`buffer_unordered(50)` caps concurrency — essential to avoid being rate-limited or overwhelming target servers.
## Pattern 2 — Streaming JSON parsing
For a 5 GB JSON file you can't load into memory, stream line-by-line (if NDJSON) or use `serde_json::Deserializer`:
```rust
use std::fs::File;
use std::io::{BufRead, BufReader};
use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct Event {
user_id: u64,
action: String,
timestamp: i64,
}
fn process_ndjson(path: &str) -> anyhow::Result {
let file = File::open(path)?;
let reader = BufReader::new(file);
let mut count = 0;
for line in reader.lines() {
let line = line?;
if line.is_empty() { continue; }
let event: Event = serde_json::from_str(&line)?;
// process event...
count += 1;
if count % 100_000 == 0 {
tracing::info!("processed {}", count);
}
}
Ok(count)
}
```
Memory usage: flat, regardless of file size.
## Pattern 3 — Parallel CPU work with rayon
For transforming millions of rows, Tokio isn't the right tool (async is for IO). Use rayon for CPU:
```rust
use rayon::prelude::*;
let rows: Vec = load_all();
let processed: Vec = rows
.par_iter()
.map(|row| expensive_transform(row))
.collect();
```
`.par_iter()` uses all CPU cores automatically.
## Pattern 4 — Bulk DB inserts with sqlx + COPY
Per-row inserts are slow. For 1M rows, use PostgreSQL's `COPY FROM`:
```rust
use sqlx::postgres::{PgPoolOptions, PgPool};
use tokio::io::AsyncWriteExt;
async fn bulk_insert(pool: &PgPool, rows: &[Product]) -> anyhow::Result<()> {
let mut conn = pool.acquire().await?;
let mut copy = conn.copy_in_raw(
"COPY products (url, title, price) FROM STDIN (FORMAT csv)"
).await?;
let mut buf = Vec::with_capacity(1024 * 1024);
for p in rows {
writeln!(
&mut buf,
"{},{},{}",
csv_escape(&p.url),
csv_escape(&p.title),
p.price.as_deref().unwrap_or("")
)?;
}
copy.send(buf.as_slice()).await?;
copy.finish().await?;
Ok(())
}
fn csv_escape(s: &str) -> String {
if s.contains(',') || s.contains('"') {
format!(""{}"", s.replace('"', """"))
} else {
s.to_string()
}
}
```
100K rows inserted in ~2 seconds this way vs 5+ minutes with individual `INSERT`s.
## Pattern 5 — Rate limiting yourself
Be a good citizen — don't hammer target servers.
```rust
use governor::{Quota, RateLimiter};
use std::num::NonZeroU32;
let quota = Quota::per_second(NonZeroU32::new(10).unwrap()); // 10 req/sec
let limiter = RateLimiter::direct(quota);
for url in &urls {
limiter.until_ready().await;
scrape_one(&client, url).await?;
}
```
## Scheduling on DomainIndia VPS
**Option 1 — cron** (simple):
```bash
# /etc/cron.d/scraper
0 */6 * * * scraper /opt/scraper/bin/scraper >> /var/log/scraper.log 2>&1
```
**Option 2 — systemd timer** (better — journald logs, retry semantics):
`/etc/systemd/system/scraper.service`:
```ini
[Unit]
Description=Web Scraper
[Service]
Type=oneshot
User=scraper
ExecStart=/opt/scraper/bin/scraper
EnvironmentFile=/opt/scraper/.env
```
`/etc/systemd/system/scraper.timer`:
```ini
[Unit]
Description=Run scraper every 6 hours
[Timer]
OnCalendar=00/6:00
Persistent=true
RandomizedDelaySec=300
[Install]
WantedBy=timers.target
```
```bash
sudo systemctl enable --now scraper.timer
systemctl list-timers
```
## Respecting robots.txt and ethics
```rust
// Pseudocode — check robots.txt before scraping
let robots = client.get("https://target.com/robots.txt").send().await?.text().await?;
// Parse with `robotparser` crate, check if your user agent is allowed
```
Ethical scraping rules:
- Respect robots.txt
- Rate limit to 1 req/sec per domain (no rush)
- Identify yourself with a User-Agent + contact URL
- Cache responses — don't re-scrape unchanged pages
- Don't scrape private/paywalled content
Violating these isn't just ethical — it can get your IP banned or invite legal action.
## Common pitfalls
## FAQ
Q
Rust or Python for scraping?
Python for <10K URLs and fast development. Rust for >100K URLs, huge files, or when reliability matters (no GC pauses, predictable memory). Both are valid.
Q
How much RAM does a Rust scraper need?
Tiny. 100 MB is plenty for most workloads. Our VPS Starter (2 GB) runs multiple concurrent scrapers comfortably.
Q
Is scraping legal?
Depends on target's terms + local law. Public data with proper rate limiting + robots.txt compliance is generally OK. Always consult a lawyer for commercial scraping.
Q
Headless browser needed?
For pure HTML, reqwest + scraper works. For JS-rendered pages, use chromiumoxide (Chrome DevTools Protocol) or fantoccini (Selenium WebDriver) — much heavier, needs ~500 MB RAM per instance.
Q
Can I deploy this on shared hosting?
No. Scrapers are long-running binaries + need custom port access. VPS only.
Rust data pipelines love a fast VPS with lots of CPU.
Pick a VPS