Multi-Platform Data Extraction Pipeline

Problem

Businesses need structured data from public sources — Google Maps business listings, e-commerce product data, job postings — but building reliable scrapers that handle anti-bot measures, pagination, and data normalization is complex and time-consuming.

Solution

Built a production scraping platform with 10 API endpoints covering 6 data sources plus utility endpoints for screenshots and PDF generation. The system uses browser automation with stealth techniques (random viewports, user agent rotation, human-like delays) to handle anti-bot measures, and delivers structured output in JSON/CSV formats.

Deployed as a self-hosted FastAPI service behind a Cloudflare Tunnel, with additional distribution through Apify marketplace actors, RapidAPI Hub subscriptions, and Fiverr gigs.

Architecture

Client Request → FastAPI Router → Scraper Engine → Data Pipeline → Response
                     │                  │                │
              Rate Limiter      Playwright/HTTP     Normalization
              API Key Auth      Stealth Techniques  JSON/CSV Output
                                Retry Logic         Pagination

Each scraper module is independent with its own parsing logic, but shares common infrastructure: browser stealth configuration, rate limiting, error handling, and output formatting.

Key Decisions

Hybrid scraping approach. Some targets work fine with HTTP requests + HTML parsing (fast, cheap). Others require full browser automation via Playwright (slower, but handles JavaScript-rendered content). The system selects the right approach per target.

Stealth over brute-force proxies. Rather than relying solely on proxy rotation (which is expensive), the scrapers use playwright-stealth with randomized viewports, user agents, geolocation spoofing, and human-like interaction delays. Proxy support is built in for when stealth alone isn’t enough.

Multi-channel monetization. The same scraping infrastructure powers RapidAPI subscriptions ($0-100/mo tiers), Fiverr gigs ($30-100/order), Apify marketplace actors, and direct CSV data sales on Gumroad.

Automated lead generation. A daily pipeline runs across 10 business verticals (HVAC, plumbing, restaurants, dental, etc.), generating 5,000+ leads per run stored as timestamped CSVs.

Results

10 API endpoints: Google Maps, e-commerce (Amazon/Walmart), Indeed jobs, Yelp, Google SERP, screenshot capture, PDF generation, health check, tier info, and admin
Published on RapidAPI Hub with tiered pricing (Free/Pro $9.99/Ultra $49.99/Mega $99.99)
3 Apify actors live on marketplace (Google Maps, e-commerce, Indeed)
Fiverr gig live with Basic $30/Standard $50/Premium $100 pricing
11 Gumroad data products for NC business leads
Automated daily lead generation across 10 verticals (5,600+ leads per run)
Running 24/7 via 4 launchd services behind Cloudflare Tunnel

How This Scales

Proxy infrastructure — Adding residential proxy rotation for high-volume targets that resist stealth techniques alone.
Webhook delivery — Push results to client endpoints instead of polling, for large async scraping jobs.
Custom scraper builder — Web UI where clients define CSS selectors and get a scraper generated automatically, reducing per-client development time.
Data freshness SLAs — Scheduled re-scrapes with diff detection, so clients always get current data without manual re-runs.

Tech Stack

Backend: Python, FastAPI, uvicorn
Scraping: Playwright (with stealth), BeautifulSoup, httpx
Data: pandas, JSON/CSV export
Deployment: Apify actors, RapidAPI proxy, Fiverr, Gumroad
Infrastructure: launchd services, Cloudflare Tunnel