Client Area

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering
6 min readPublished 25 Apr 2026Updated 23 Jun 2026156 views

In this article

  • 1Why Rust for data pipelines
  • 2Project scaffold
  • 3Pattern 1 — Async bulk scraping
  • 4Pattern 2 — Streaming JSON parsing
  • 5Pattern 3 — Parallel CPU work with rayon

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

TL;DR
Rust's speed + memory safety makes it ideal for data pipelines that scrape, transform, or ingest millions of records. This guide shows async scraping with reqwest + scraper, parsing JSON/CSV at scale, writing to PostgreSQL with sqlx, and scheduling on DomainIndia VPS.

Why Rust for data pipelines

Typical scraping/ETL workloads in Python:

  • scrape 10K URLs → takes 2 hours (requests + BeautifulSoup, single-threaded)
  • parse 5 GB JSON → 20 minutes (Python's json parser)
  • load into DB → bottlenecked by per-row inserts

Same workloads in Rust:

  • 10K URLs → 3 minutes (async reqwest, 100 concurrent)
  • 5 GB JSON → 90 seconds (serde_json + rayon)
  • DB load → 30 seconds with COPY FROM + binary format

Rust wins when you're CPU/IO-bound on data work. For light scraping (100 URLs/day), Python is fine — less code to write.

Project scaffold

bash
cargo new --bin scraper && cd scraper

Cargo.toml:

toml
[dependencies]
tokio = { version = "1.36", features = ["full"] }
reqwest = { version = "0.12", features = ["json", "gzip", "rustls-tls"] }
scraper = "0.19"           # HTML parsing
serde = { version = "1", features = ["derive"] }
serde_json = "1"
anyhow = "1"               # error handling
tracing = "0.1"
tracing-subscriber = "0.3"
futures = "0.3"
sqlx = { version = "0.7", features = ["runtime-tokio-rustls", "postgres", "chrono"] }
csv = "1.3"

Pattern 1 — Async bulk scraping

Scrape 10,000 URLs concurrently, extract title + price, save to CSV.

rust
use futures::stream::{self, StreamExt};
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;

#[derive(Serialize, Debug)]
struct Product {
    url: String,
    title: String,
    price: Option<String>,
}

async fn scrape_one(client: &Client, url: &str) -> anyhow::Result<Product> {
    let html = client.get(url).send().await?.text().await?;
    let doc = Html::parse_document(&html);
    let title_sel = Selector::parse("h1.product-title").unwrap();
    let price_sel = Selector::parse("span.price").unwrap();

    let title = doc.select(&title_sel)
        .next()
        .map(|n| n.text().collect::<String>().trim().to_string())
        .unwrap_or_default();
    let price = doc.select(&price_sel)
        .next()
        .map(|n| n.text().collect::<String>().trim().to_string());

    Ok(Product { url: url.to_string(), title, price })
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    tracing_subscriber::fmt::init();

    let urls: Vec<String> = std::fs::read_to_string("urls.txt")?
        .lines().map(String::from).collect();

    let client = Client::builder()
        .user_agent("MyScraper/1.0 (+https://yourcompany.com/bot)")
        .timeout(std::time::Duration::from_secs(10))
        .gzip(true)
        .build()?;

    let results: Vec<Product> = stream::iter(urls.iter())
        .map(|url| {
            let client = &client;
            async move {
                scrape_one(client, url).await
                    .unwrap_or_else(|e| {
                        tracing::warn!("failed {url}: {e}");
                        Product { url: url.to_string(), title: String::new(), price: None }
                    })
            }
        })
        .buffer_unordered(50)  // 50 concurrent requests
        .collect()
        .await;

    let mut wtr = csv::Writer::from_path("products.csv")?;
    for p in &results { wtr.serialize(p)?; }
    wtr.flush()?;

    tracing::info!("scraped {} products", results.len());
    Ok(())
}

buffer_unordered(50) caps concurrency — essential to avoid being rate-limited or overwhelming target servers.

Pattern 2 — Streaming JSON parsing

For a 5 GB JSON file you can't load into memory, stream line-by-line (if NDJSON) or use serde_json::Deserializer:

rust
use std::fs::File;
use std::io::{BufRead, BufReader};
use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct Event {
    user_id: u64,
    action: String,
    timestamp: i64,
}

fn process_ndjson(path: &str) -> anyhow::Result<u64> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);
    let mut count = 0;

    for line in reader.lines() {
        let line = line?;
        if line.is_empty() { continue; }
        let event: Event = serde_json::from_str(&line)?;
        // process event...
        count += 1;
        if count % 100_000 == 0 {
            tracing::info!("processed {}", count);
        }
    }
    Ok(count)
}

Memory usage: flat, regardless of file size.

Pattern 3 — Parallel CPU work with rayon

For transforming millions of rows, Tokio isn't the right tool (async is for IO). Use rayon for CPU:

rust
use rayon::prelude::*;

let rows: Vec<RawRow> = load_all();
let processed: Vec<ProcessedRow> = rows
    .par_iter()
    .map(|row| expensive_transform(row))
    .collect();

.par_iter() uses all CPU cores automatically.

Pattern 4 — Bulk DB inserts with sqlx + COPY

Per-row inserts are slow. For 1M rows, use PostgreSQL's COPY FROM:

rust
use sqlx::postgres::{PgPoolOptions, PgPool};
use tokio::io::AsyncWriteExt;

async fn bulk_insert(pool: &PgPool, rows: &[Product]) -> anyhow::Result<()> {
    let mut conn = pool.acquire().await?;
    let mut copy = conn.copy_in_raw(
        "COPY products (url, title, price) FROM STDIN (FORMAT csv)"
    ).await?;

    let mut buf = Vec::with_capacity(1024 * 1024);
    for p in rows {
        writeln!(
            &mut buf,
            "{},{},{}",
            csv_escape(&p.url),
            csv_escape(&p.title),
            p.price.as_deref().unwrap_or("")
        )?;
    }

    copy.send(buf.as_slice()).await?;
    copy.finish().await?;
    Ok(())
}

fn csv_escape(s: &str) -> String {
    if s.contains(',') || s.contains('"') {
        format!(""{}"", s.replace('"', """"))
    } else {
        s.to_string()
    }
}

100K rows inserted in ~2 seconds this way vs 5+ minutes with individual INSERTs.

Pattern 5 — Rate limiting yourself

Be a good citizen — don't hammer target servers.

rust
use governor::{Quota, RateLimiter};
use std::num::NonZeroU32;

let quota = Quota::per_second(NonZeroU32::new(10).unwrap()); // 10 req/sec
let limiter = RateLimiter::direct(quota);

for url in &urls {
    limiter.until_ready().await;
    scrape_one(&client, url).await?;
}

Scheduling on DomainIndia VPS

Option 1 — cron (simple):

bash
# /etc/cron.d/scraper
0 */6 * * *  scraper  /opt/scraper/bin/scraper >> /var/log/scraper.log 2>&1

Option 2 — systemd timer (better — journald logs, retry semantics):

/etc/systemd/system/scraper.service:

ini
[Unit]
Description=Web Scraper
[Service]
Type=oneshot
User=scraper
ExecStart=/opt/scraper/bin/scraper
EnvironmentFile=/opt/scraper/.env

/etc/systemd/system/scraper.timer:

ini
[Unit]
Description=Run scraper every 6 hours

[Timer]
OnCalendar=00/6:00
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target
bash
sudo systemctl enable --now scraper.timer
systemctl list-timers

Respecting robots.txt and ethics

rust
// Pseudocode — check robots.txt before scraping
let robots = client.get("https://target.com/robots.txt").send().await?.text().await?;
// Parse with `robotparser` crate, check if your user agent is allowed

Ethical scraping rules:

  • Respect robots.txt
  • Rate limit to 1 req/sec per domain (no rush)
  • Identify yourself with a User-Agent + contact URL
  • Cache responses — don't re-scrape unchanged pages
  • Don't scrape private/paywalled content

Violating these isn't just ethical — it can get your IP banned or invite legal action.

Common pitfalls

FAQ

Q Rust or Python for scraping?

Python for <10K URLs and fast development. Rust for >100K URLs, huge files, or when reliability matters (no GC pauses, predictable memory). Both are valid.

Q How much RAM does a Rust scraper need?

Tiny. 100 MB is plenty for most workloads. Our VPS Starter (2 GB) runs multiple concurrent scrapers comfortably.

Q Is scraping legal?

Depends on target's terms + local law. Public data with proper rate limiting + robots.txt compliance is generally OK. Always consult a lawyer for commercial scraping.

Q Headless browser needed?

For pure HTML, reqwest + scraper works. For JS-rendered pages, use chromiumoxide (Chrome DevTools Protocol) or fantoccini (Selenium WebDriver) — much heavier, needs ~500 MB RAM per instance.

Q Can I deploy this on shared hosting?

No. Scrapers are long-running binaries + need custom port access. VPS only.

Rust data pipelines love a fast VPS with lots of CPU. Pick a VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket