Client Area

Rust for Web Scraping and Data Pipelines on DomainIndia VPS

ByDomain India Team·DomainIndia Engineering
6 min read24 Apr 20263 views
# Rust for Web Scraping and Data Pipelines on DomainIndia VPS
TL;DR
Rust's speed + memory safety makes it ideal for data pipelines that scrape, transform, or ingest millions of records. This guide shows async scraping with reqwest + scraper, parsing JSON/CSV at scale, writing to PostgreSQL with sqlx, and scheduling on DomainIndia VPS.
## Why Rust for data pipelines Typical scraping/ETL workloads in Python: - scrape 10K URLs → takes 2 hours (requests + BeautifulSoup, single-threaded) - parse 5 GB JSON → 20 minutes (Python's json parser) - load into DB → bottlenecked by per-row inserts Same workloads in Rust: - 10K URLs → 3 minutes (async reqwest, 100 concurrent) - 5 GB JSON → 90 seconds (serde_json + rayon) - DB load → 30 seconds with `COPY FROM` + binary format Rust wins when you're CPU/IO-bound on data work. For light scraping (100 URLs/day), Python is fine — less code to write. ## Project scaffold ```bash cargo new --bin scraper && cd scraper ``` `Cargo.toml`: ```toml [dependencies] tokio = { version = "1.36", features = ["full"] } reqwest = { version = "0.12", features = ["json", "gzip", "rustls-tls"] } scraper = "0.19" # HTML parsing serde = { version = "1", features = ["derive"] } serde_json = "1" anyhow = "1" # error handling tracing = "0.1" tracing-subscriber = "0.3" futures = "0.3" sqlx = { version = "0.7", features = ["runtime-tokio-rustls", "postgres", "chrono"] } csv = "1.3" ``` ## Pattern 1 — Async bulk scraping Scrape 10,000 URLs concurrently, extract title + price, save to CSV. ```rust use futures::stream::{self, StreamExt}; use reqwest::Client; use scraper::{Html, Selector}; use serde::Serialize; #[derive(Serialize, Debug)] struct Product { url: String, title: String, price: Option, } async fn scrape_one(client: &Client, url: &str) -> anyhow::Result { let html = client.get(url).send().await?.text().await?; let doc = Html::parse_document(&html); let title_sel = Selector::parse("h1.product-title").unwrap(); let price_sel = Selector::parse("span.price").unwrap(); let title = doc.select(&title_sel) .next() .map(|n| n.text().collect::().trim().to_string()) .unwrap_or_default(); let price = doc.select(&price_sel) .next() .map(|n| n.text().collect::().trim().to_string()); Ok(Product { url: url.to_string(), title, price }) } #[tokio::main] async fn main() -> anyhow::Result<()> { tracing_subscriber::fmt::init(); let urls: Vec = std::fs::read_to_string("urls.txt")? .lines().map(String::from).collect(); let client = Client::builder() .user_agent("MyScraper/1.0 (+https://yourcompany.com/bot)") .timeout(std::time::Duration::from_secs(10)) .gzip(true) .build()?; let results: Vec = stream::iter(urls.iter()) .map(|url| { let client = &client; async move { scrape_one(client, url).await .unwrap_or_else(|e| { tracing::warn!("failed {url}: {e}"); Product { url: url.to_string(), title: String::new(), price: None } }) } }) .buffer_unordered(50) // 50 concurrent requests .collect() .await; let mut wtr = csv::Writer::from_path("products.csv")?; for p in &results { wtr.serialize(p)?; } wtr.flush()?; tracing::info!("scraped {} products", results.len()); Ok(()) } ``` `buffer_unordered(50)` caps concurrency — essential to avoid being rate-limited or overwhelming target servers. ## Pattern 2 — Streaming JSON parsing For a 5 GB JSON file you can't load into memory, stream line-by-line (if NDJSON) or use `serde_json::Deserializer`: ```rust use std::fs::File; use std::io::{BufRead, BufReader}; use serde::Deserialize; #[derive(Deserialize, Debug)] struct Event { user_id: u64, action: String, timestamp: i64, } fn process_ndjson(path: &str) -> anyhow::Result { let file = File::open(path)?; let reader = BufReader::new(file); let mut count = 0; for line in reader.lines() { let line = line?; if line.is_empty() { continue; } let event: Event = serde_json::from_str(&line)?; // process event... count += 1; if count % 100_000 == 0 { tracing::info!("processed {}", count); } } Ok(count) } ``` Memory usage: flat, regardless of file size. ## Pattern 3 — Parallel CPU work with rayon For transforming millions of rows, Tokio isn't the right tool (async is for IO). Use rayon for CPU: ```rust use rayon::prelude::*; let rows: Vec = load_all(); let processed: Vec = rows .par_iter() .map(|row| expensive_transform(row)) .collect(); ``` `.par_iter()` uses all CPU cores automatically. ## Pattern 4 — Bulk DB inserts with sqlx + COPY Per-row inserts are slow. For 1M rows, use PostgreSQL's `COPY FROM`: ```rust use sqlx::postgres::{PgPoolOptions, PgPool}; use tokio::io::AsyncWriteExt; async fn bulk_insert(pool: &PgPool, rows: &[Product]) -> anyhow::Result<()> { let mut conn = pool.acquire().await?; let mut copy = conn.copy_in_raw( "COPY products (url, title, price) FROM STDIN (FORMAT csv)" ).await?; let mut buf = Vec::with_capacity(1024 * 1024); for p in rows { writeln!( &mut buf, "{},{},{}", csv_escape(&p.url), csv_escape(&p.title), p.price.as_deref().unwrap_or("") )?; } copy.send(buf.as_slice()).await?; copy.finish().await?; Ok(()) } fn csv_escape(s: &str) -> String { if s.contains(',') || s.contains('"') { format!(""{}"", s.replace('"', """")) } else { s.to_string() } } ``` 100K rows inserted in ~2 seconds this way vs 5+ minutes with individual `INSERT`s. ## Pattern 5 — Rate limiting yourself Be a good citizen — don't hammer target servers. ```rust use governor::{Quota, RateLimiter}; use std::num::NonZeroU32; let quota = Quota::per_second(NonZeroU32::new(10).unwrap()); // 10 req/sec let limiter = RateLimiter::direct(quota); for url in &urls { limiter.until_ready().await; scrape_one(&client, url).await?; } ``` ## Scheduling on DomainIndia VPS **Option 1 — cron** (simple): ```bash # /etc/cron.d/scraper 0 */6 * * * scraper /opt/scraper/bin/scraper >> /var/log/scraper.log 2>&1 ``` **Option 2 — systemd timer** (better — journald logs, retry semantics): `/etc/systemd/system/scraper.service`: ```ini [Unit] Description=Web Scraper [Service] Type=oneshot User=scraper ExecStart=/opt/scraper/bin/scraper EnvironmentFile=/opt/scraper/.env ``` `/etc/systemd/system/scraper.timer`: ```ini [Unit] Description=Run scraper every 6 hours [Timer] OnCalendar=00/6:00 Persistent=true RandomizedDelaySec=300 [Install] WantedBy=timers.target ``` ```bash sudo systemctl enable --now scraper.timer systemctl list-timers ``` ## Respecting robots.txt and ethics ```rust // Pseudocode — check robots.txt before scraping let robots = client.get("https://target.com/robots.txt").send().await?.text().await?; // Parse with `robotparser` crate, check if your user agent is allowed ``` Ethical scraping rules: - Respect robots.txt - Rate limit to 1 req/sec per domain (no rush) - Identify yourself with a User-Agent + contact URL - Cache responses — don't re-scrape unchanged pages - Don't scrape private/paywalled content Violating these isn't just ethical — it can get your IP banned or invite legal action. ## Common pitfalls ## FAQ
Q Rust or Python for scraping?

Python for <10K URLs and fast development. Rust for >100K URLs, huge files, or when reliability matters (no GC pauses, predictable memory). Both are valid.

Q How much RAM does a Rust scraper need?

Tiny. 100 MB is plenty for most workloads. Our VPS Starter (2 GB) runs multiple concurrent scrapers comfortably.

Q Is scraping legal?

Depends on target's terms + local law. Public data with proper rate limiting + robots.txt compliance is generally OK. Always consult a lawyer for commercial scraping.

Q Headless browser needed?

For pure HTML, reqwest + scraper works. For JS-rendered pages, use chromiumoxide (Chrome DevTools Protocol) or fantoccini (Selenium WebDriver) — much heavier, needs ~500 MB RAM per instance.

Q Can I deploy this on shared hosting?

No. Scrapers are long-running binaries + need custom port access. VPS only.

Rust data pipelines love a fast VPS with lots of CPU. Pick a VPS

Was this article helpful?

Your feedback helps us improve our documentation

Still need help? Submit a support ticket

Still need help?

Our support team can assist you directly.

Submit Ticket