Multi-Platform Data Extraction Pipeline
Production scraping API with 10 endpoints extracting structured data from Google Maps, Amazon, Walmart, Indeed, Yelp, and Google SERP — deployed on RapidAPI, Apify, and Fiverr.
Problem
Businesses need structured data from public sources — Google Maps business listings, e-commerce product data, job postings — but building reliable scrapers that handle anti-bot measures, pagination, and data normalization is complex and time-consuming.
Solution
Built a production scraping platform with 10 API endpoints covering 6 data sources plus utility endpoints for screenshots and PDF generation. The system uses browser automation with stealth techniques (random viewports, user agent rotation, human-like delays) to handle anti-bot measures, and delivers structured output in JSON/CSV formats.
Deployed as a self-hosted FastAPI service behind a Cloudflare Tunnel, with additional distribution through Apify marketplace actors, RapidAPI Hub subscriptions, and Fiverr gigs.
Architecture
Client Request → FastAPI Router → Scraper Engine → Data Pipeline → Response
│ │ │
Rate Limiter Playwright/HTTP Normalization
API Key Auth Stealth Techniques JSON/CSV Output
Retry Logic Pagination
Each scraper module is independent with its own parsing logic, but shares common infrastructure: browser stealth configuration, rate limiting, error handling, and output formatting.
Key Decisions
Hybrid scraping approach. Some targets work fine with HTTP requests + HTML parsing (fast, cheap). Others require full browser automation via Playwright (slower, but handles JavaScript-rendered content). The system selects the right approach per target.
Stealth over brute-force proxies. Rather than relying solely on proxy rotation (which is expensive), the scrapers use playwright-stealth with randomized viewports, user agents, geolocation spoofing, and human-like interaction delays. Proxy support is built in for when stealth alone isn’t enough.
Multi-channel monetization. The same scraping infrastructure powers RapidAPI subscriptions ($0-100/mo tiers), Fiverr gigs ($30-100/order), Apify marketplace actors, and direct CSV data sales on Gumroad.
Automated lead generation. A daily pipeline runs across 10 business verticals (HVAC, plumbing, restaurants, dental, etc.), generating 5,000+ leads per run stored as timestamped CSVs.
Results
- 10 API endpoints: Google Maps, e-commerce (Amazon/Walmart), Indeed jobs, Yelp, Google SERP, screenshot capture, PDF generation, health check, tier info, and admin
- Published on RapidAPI Hub with tiered pricing (Free/Pro $9.99/Ultra $49.99/Mega $99.99)
- 3 Apify actors live on marketplace (Google Maps, e-commerce, Indeed)
- Fiverr gig live with Basic $30/Standard $50/Premium $100 pricing
- 11 Gumroad data products for NC business leads
- Automated daily lead generation across 10 verticals (5,600+ leads per run)
- Running 24/7 via 4 launchd services behind Cloudflare Tunnel
How This Scales
- Proxy infrastructure — Adding residential proxy rotation for high-volume targets that resist stealth techniques alone.
- Webhook delivery — Push results to client endpoints instead of polling, for large async scraping jobs.
- Custom scraper builder — Web UI where clients define CSS selectors and get a scraper generated automatically, reducing per-client development time.
- Data freshness SLAs — Scheduled re-scrapes with diff detection, so clients always get current data without manual re-runs.
Tech Stack
- Backend: Python, FastAPI, uvicorn
- Scraping: Playwright (with stealth), BeautifulSoup, httpx
- Data: pandas, JSON/CSV export
- Deployment: Apify actors, RapidAPI proxy, Fiverr, Gumroad
- Infrastructure: launchd services, Cloudflare Tunnel