Rust for Web Scraping and Data Pipelines on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering

6 min readPublished 25 Apr 2026Updated 13 Jul 2026202 views

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

TL;DR

Rust's speed + memory safety makes it ideal for data pipelines that scrape, transform, or ingest millions of records. This guide shows async scraping with reqwest + scraper, parsing JSON/CSV at scale, writing to PostgreSQL with sqlx, and scheduling on DomainIndia VPS.

Why Rust for data pipelines

Typical scraping/ETL workloads in Python:

scrape 10K URLs → takes 2 hours (requests + BeautifulSoup, single-threaded)
parse 5 GB JSON → 20 minutes (Python's json parser)
load into DB → bottlenecked by per-row inserts

Same workloads in Rust:

10K URLs → 3 minutes (async reqwest, 100 concurrent)
5 GB JSON → 90 seconds (serde_json + rayon)
DB load → 30 seconds with COPY FROM + binary format

Rust wins when you're CPU/IO-bound on data work. For light scraping (100 URLs/day), Python is fine — less code to write.

Project scaffold

bash

cargo new --bin scraper && cd scraper

Cargo.toml:

toml

[dependencies]
tokio = { version = "1.36", features = ["full"] }
reqwest = { version = "0.12", features = ["json", "gzip", "rustls-tls"] }
scraper = "0.19"           # HTML parsing
serde = { version = "1", features = ["derive"] }
serde_json = "1"
anyhow = "1"               # error handling
tracing = "0.1"
tracing-subscriber = "0.3"
futures = "0.3"
sqlx = { version = "0.7", features = ["runtime-tokio-rustls", "postgres", "chrono"] }
csv = "1.3"

Pattern 1 — Async bulk scraping

Scrape 10,000 URLs concurrently, extract title + price, save to CSV.

rust

use futures::stream::{self, StreamExt};
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;

#[derive(Serialize, Debug)]
struct Product {
    url: String,
    title: String,
    price: Option<String>,
}

async fn scrape_one(client: &Client, url: &str) -> anyhow::Result<Product> {
    let html = client.get(url).send().await?.text().await?;
    let doc = Html::parse_document(&html);
    let title_sel = Selector::parse("h1.product-title").unwrap();
    let price_sel = Selector::parse("span.price").unwrap();

    let title = doc.select(&title_sel)
        .next()
        .map(|n| n.text().collect::<String>().trim().to_string())
        .unwrap_or_default();
    let price = doc.select(&price_sel)
        .next()
        .map(|n| n.text().collect::<String>().trim().to_string());

    Ok(Product { url: url.to_string(), title, price })
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    tracing_subscriber::fmt::init();

    let urls: Vec<String> = std::fs::read_to_string("urls.txt")?
        .lines().map(String::from).collect();

    let client = Client::builder()
        .user_agent("MyScraper/1.0 (+https://yourcompany.com/bot)")
        .timeout(std::time::Duration::from_secs(10))
        .gzip(true)
        .build()?;

    let results: Vec<Product> = stream::iter(urls.iter())
        .map(|url| {
            let client = &client;
            async move {
                scrape_one(client, url).await
                    .unwrap_or_else(|e| {
                        tracing::warn!("failed {url}: {e}");
                        Product { url: url.to_string(), title: String::new(), price: None }
                    })
            }
        })
        .buffer_unordered(50)  // 50 concurrent requests
        .collect()
        .await;

    let mut wtr = csv::Writer::from_path("products.csv")?;
    for p in &results { wtr.serialize(p)?; }
    wtr.flush()?;

    tracing::info!("scraped {} products", results.len());
    Ok(())
}

buffer_unordered(50) caps concurrency — essential to avoid being rate-limited or overwhelming target servers.

Pattern 2 — Streaming JSON parsing

For a 5 GB JSON file you can't load into memory, stream line-by-line (if NDJSON) or use serde_json::Deserializer:

rust

use std::fs::File;
use std::io::{BufRead, BufReader};
use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct Event {
    user_id: u64,
    action: String,
    timestamp: i64,
}

fn process_ndjson(path: &str) -> anyhow::Result<u64> {
    let file = File::open(path)?;
    let reader = BufReader::new(file);
    let mut count = 0;

    for line in reader.lines() {
        let line = line?;
        if line.is_empty() { continue; }
        let event: Event = serde_json::from_str(&line)?;
        // process event...
        count += 1;
        if count % 100_000 == 0 {
            tracing::info!("processed {}", count);
        }
    }
    Ok(count)
}

Memory usage: flat, regardless of file size.

Pattern 3 — Parallel CPU work with rayon

For transforming millions of rows, Tokio isn't the right tool (async is for IO). Use rayon for CPU:

rust

use rayon::prelude::*;

let rows: Vec<RawRow> = load_all();
let processed: Vec<ProcessedRow> = rows
    .par_iter()
    .map(|row| expensive_transform(row))
    .collect();

.par_iter() uses all CPU cores automatically.

Pattern 4 — Bulk DB inserts with sqlx + COPY

Per-row inserts are slow. For 1M rows, use PostgreSQL's COPY FROM:

rust

use sqlx::postgres::{PgPoolOptions, PgPool};
use tokio::io::AsyncWriteExt;

async fn bulk_insert(pool: &PgPool, rows: &[Product]) -> anyhow::Result<()> {
    let mut conn = pool.acquire().await?;
    let mut copy = conn.copy_in_raw(
        "COPY products (url, title, price) FROM STDIN (FORMAT csv)"
    ).await?;

    let mut buf = Vec::with_capacity(1024 * 1024);
    for p in rows {
        writeln!(
            &mut buf,
            "{},{},{}",
            csv_escape(&p.url),
            csv_escape(&p.title),
            p.price.as_deref().unwrap_or("")
        )?;
    }

    copy.send(buf.as_slice()).await?;
    copy.finish().await?;
    Ok(())
}

fn csv_escape(s: &str) -> String {
    if s.contains(',') || s.contains('"') {
        format!(""{}"", s.replace('"', """"))
    } else {
        s.to_string()
    }
}

100K rows inserted in ~2 seconds this way vs 5+ minutes with individual INSERTs.

Pattern 5 — Rate limiting yourself

Be a good citizen — don't hammer target servers.

rust

use governor::{Quota, RateLimiter};
use std::num::NonZeroU32;

let quota = Quota::per_second(NonZeroU32::new(10).unwrap()); // 10 req/sec
let limiter = RateLimiter::direct(quota);

for url in &urls {
    limiter.until_ready().await;
    scrape_one(&client, url).await?;
}

Scheduling on DomainIndia VPS

Option 1 — cron (simple):

bash

# /etc/cron.d/scraper
0 */6 * * *  scraper  /opt/scraper/bin/scraper >> /var/log/scraper.log 2>&1

Option 2 — systemd timer (better — journald logs, retry semantics):

/etc/systemd/system/scraper.service:

ini

[Unit]
Description=Web Scraper
[Service]
Type=oneshot
User=scraper
ExecStart=/opt/scraper/bin/scraper
EnvironmentFile=/opt/scraper/.env

/etc/systemd/system/scraper.timer:

ini

[Unit]
Description=Run scraper every 6 hours

[Timer]
OnCalendar=00/6:00
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target

bash

sudo systemctl enable --now scraper.timer
systemctl list-timers

Respecting robots.txt and ethics

rust

// Pseudocode — check robots.txt before scraping
let robots = client.get("https://target.com/robots.txt").send().await?.text().await?;
// Parse with `robotparser` crate, check if your user agent is allowed

Ethical scraping rules:

Respect robots.txt
Rate limit to 1 req/sec per domain (no rush)
Identify yourself with a User-Agent + contact URL
Cache responses — don't re-scrape unchanged pages
Don't scrape private/paywalled content

Violating these isn't just ethical — it can get your IP banned or invite legal action.

Common pitfalls

FAQ

Q Rust or Python for scraping?

Python for <10K URLs and fast development. Rust for >100K URLs, huge files, or when reliability matters (no GC pauses, predictable memory). Both are valid.

Q How much RAM does a Rust scraper need?

Tiny. 100 MB is plenty for most workloads. Our VPS Starter (2 GB) runs multiple concurrent scrapers comfortably.

Q Is scraping legal?

Depends on target's terms + local law. Public data with proper rate limiting + robots.txt compliance is generally OK. Always consult a lawyer for commercial scraping.

Q Headless browser needed?

For pure HTML, reqwest + scraper works. For JS-rendered pages, use chromiumoxide (Chrome DevTools Protocol) or fantoccini (Selenium WebDriver) — much heavier, needs ~500 MB RAM per instance.

Q Can I deploy this on shared hosting?

No. Scrapers are long-running binaries + need custom port access. VPS only.

Rust data pipelines love a fast VPS with lots of CPU. Pick a VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket

Rust Web Applications on DomainIndia VPS: Actix-web and Axum204

Still need help?

Our support team can assist you directly.

Submit Ticket

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

In this article

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

Why Rust for data pipelines

Project scaffold

Pattern 1 — Async bulk scraping

Pattern 2 — Streaming JSON parsing

Pattern 3 — Parallel CPU work with rayon

Pattern 4 — Bulk DB inserts with sqlx + COPY

Pattern 5 — Rate limiting yourself

Scheduling on DomainIndia VPS

Respecting robots.txt and ethics

Common pitfalls

FAQ

Was this article helpful?

Related Articles

Still need help?