Rust for Web Scraping and Data Pipelines on DomainIndia VPS
Why Rust for data pipelines
Typical scraping/ETL workloads in Python:
- scrape 10K URLs → takes 2 hours (requests + BeautifulSoup, single-threaded)
- parse 5 GB JSON → 20 minutes (Python's json parser)
- load into DB → bottlenecked by per-row inserts
Same workloads in Rust:
- 10K URLs → 3 minutes (async reqwest, 100 concurrent)
- 5 GB JSON → 90 seconds (serde_json + rayon)
- DB load → 30 seconds with
COPY FROM+ binary format
Rust wins when you're CPU/IO-bound on data work. For light scraping (100 URLs/day), Python is fine — less code to write.
Project scaffold
cargo new --bin scraper && cd scraperCargo.toml:
[dependencies]
tokio = { version = "1.36", features = ["full"] }
reqwest = { version = "0.12", features = ["json", "gzip", "rustls-tls"] }
scraper = "0.19" # HTML parsing
serde = { version = "1", features = ["derive"] }
serde_json = "1"
anyhow = "1" # error handling
tracing = "0.1"
tracing-subscriber = "0.3"
futures = "0.3"
sqlx = { version = "0.7", features = ["runtime-tokio-rustls", "postgres", "chrono"] }
csv = "1.3"Pattern 1 — Async bulk scraping
Scrape 10,000 URLs concurrently, extract title + price, save to CSV.
use futures::stream::{self, StreamExt};
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
#[derive(Serialize, Debug)]
struct Product {
url: String,
title: String,
price: Option<String>,
}
async fn scrape_one(client: &Client, url: &str) -> anyhow::Result<Product> {
let html = client.get(url).send().await?.text().await?;
let doc = Html::parse_document(&html);
let title_sel = Selector::parse("h1.product-title").unwrap();
let price_sel = Selector::parse("span.price").unwrap();
let title = doc.select(&title_sel)
.next()
.map(|n| n.text().collect::<String>().trim().to_string())
.unwrap_or_default();
let price = doc.select(&price_sel)
.next()
.map(|n| n.text().collect::<String>().trim().to_string());
Ok(Product { url: url.to_string(), title, price })
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
tracing_subscriber::fmt::init();
let urls: Vec<String> = std::fs::read_to_string("urls.txt")?
.lines().map(String::from).collect();
let client = Client::builder()
.user_agent("MyScraper/1.0 (+https://yourcompany.com/bot)")
.timeout(std::time::Duration::from_secs(10))
.gzip(true)
.build()?;
let results: Vec<Product> = stream::iter(urls.iter())
.map(|url| {
let client = &client;
async move {
scrape_one(client, url).await
.unwrap_or_else(|e| {
tracing::warn!("failed {url}: {e}");
Product { url: url.to_string(), title: String::new(), price: None }
})
}
})
.buffer_unordered(50) // 50 concurrent requests
.collect()
.await;
let mut wtr = csv::Writer::from_path("products.csv")?;
for p in &results { wtr.serialize(p)?; }
wtr.flush()?;
tracing::info!("scraped {} products", results.len());
Ok(())
}buffer_unordered(50) caps concurrency — essential to avoid being rate-limited or overwhelming target servers.
Pattern 2 — Streaming JSON parsing
For a 5 GB JSON file you can't load into memory, stream line-by-line (if NDJSON) or use serde_json::Deserializer:
use std::fs::File;
use std::io::{BufRead, BufReader};
use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct Event {
user_id: u64,
action: String,
timestamp: i64,
}
fn process_ndjson(path: &str) -> anyhow::Result<u64> {
let file = File::open(path)?;
let reader = BufReader::new(file);
let mut count = 0;
for line in reader.lines() {
let line = line?;
if line.is_empty() { continue; }
let event: Event = serde_json::from_str(&line)?;
// process event...
count += 1;
if count % 100_000 == 0 {
tracing::info!("processed {}", count);
}
}
Ok(count)
}Memory usage: flat, regardless of file size.
Pattern 3 — Parallel CPU work with rayon
For transforming millions of rows, Tokio isn't the right tool (async is for IO). Use rayon for CPU:
use rayon::prelude::*;
let rows: Vec<RawRow> = load_all();
let processed: Vec<ProcessedRow> = rows
.par_iter()
.map(|row| expensive_transform(row))
.collect();.par_iter() uses all CPU cores automatically.
Pattern 4 — Bulk DB inserts with sqlx + COPY
Per-row inserts are slow. For 1M rows, use PostgreSQL's COPY FROM:
use sqlx::postgres::{PgPoolOptions, PgPool};
use tokio::io::AsyncWriteExt;
async fn bulk_insert(pool: &PgPool, rows: &[Product]) -> anyhow::Result<()> {
let mut conn = pool.acquire().await?;
let mut copy = conn.copy_in_raw(
"COPY products (url, title, price) FROM STDIN (FORMAT csv)"
).await?;
let mut buf = Vec::with_capacity(1024 * 1024);
for p in rows {
writeln!(
&mut buf,
"{},{},{}",
csv_escape(&p.url),
csv_escape(&p.title),
p.price.as_deref().unwrap_or("")
)?;
}
copy.send(buf.as_slice()).await?;
copy.finish().await?;
Ok(())
}
fn csv_escape(s: &str) -> String {
if s.contains(',') || s.contains('"') {
format!(""{}"", s.replace('"', """"))
} else {
s.to_string()
}
}100K rows inserted in ~2 seconds this way vs 5+ minutes with individual INSERTs.
Pattern 5 — Rate limiting yourself
Be a good citizen — don't hammer target servers.
use governor::{Quota, RateLimiter};
use std::num::NonZeroU32;
let quota = Quota::per_second(NonZeroU32::new(10).unwrap()); // 10 req/sec
let limiter = RateLimiter::direct(quota);
for url in &urls {
limiter.until_ready().await;
scrape_one(&client, url).await?;
}Scheduling on DomainIndia VPS
Option 1 — cron (simple):
# /etc/cron.d/scraper
0 */6 * * * scraper /opt/scraper/bin/scraper >> /var/log/scraper.log 2>&1Option 2 — systemd timer (better — journald logs, retry semantics):
/etc/systemd/system/scraper.service:
[Unit]
Description=Web Scraper
[Service]
Type=oneshot
User=scraper
ExecStart=/opt/scraper/bin/scraper
EnvironmentFile=/opt/scraper/.env/etc/systemd/system/scraper.timer:
[Unit]
Description=Run scraper every 6 hours
[Timer]
OnCalendar=00/6:00
Persistent=true
RandomizedDelaySec=300
[Install]
WantedBy=timers.targetsudo systemctl enable --now scraper.timer
systemctl list-timersRespecting robots.txt and ethics
// Pseudocode — check robots.txt before scraping
let robots = client.get("https://target.com/robots.txt").send().await?.text().await?;
// Parse with `robotparser` crate, check if your user agent is allowedEthical scraping rules:
- Respect robots.txt
- Rate limit to 1 req/sec per domain (no rush)
- Identify yourself with a User-Agent + contact URL
- Cache responses — don't re-scrape unchanged pages
- Don't scrape private/paywalled content
Violating these isn't just ethical — it can get your IP banned or invite legal action.
Common pitfalls
FAQ
Python for <10K URLs and fast development. Rust for >100K URLs, huge files, or when reliability matters (no GC pauses, predictable memory). Both are valid.
Tiny. 100 MB is plenty for most workloads. Our VPS Starter (2 GB) runs multiple concurrent scrapers comfortably.
Depends on target's terms + local law. Public data with proper rate limiting + robots.txt compliance is generally OK. Always consult a lawyer for commercial scraping.
For pure HTML, reqwest + scraper works. For JS-rendered pages, use chromiumoxide (Chrome DevTools Protocol) or fantoccini (Selenium WebDriver) — much heavier, needs ~500 MB RAM per instance.
No. Scrapers are long-running binaries + need custom port access. VPS only.
Rust data pipelines love a fast VPS with lots of CPU. Pick a VPS