How to Extract Job Posting Data at Scale

Most guides on scraping job boards skip the part where you get a cease-and-desist letter. They jump straight to Python scripts and proxy configs, glossing over the fact that every major job board explicitly bans automated data collection in their terms of service. So before you spin up a crawler, you need to know what you’re actually allowed to do.

This piece breaks down what six major boards actually permit and the crawl engineering patterns that keep you compliant. It also covers what residential proxy bandwidth really costs when you’re pulling job data at volume.

What the Major Job Boards Actually Say About Scraping

The short version: almost all of them say no.

LinkedIn’s User Agreement prohibits using “scripts, robots, or crawlers” to scrape or copy the service, and separately bans bypassing access controls or use limits. Their crawling terms go further, stating that automated crawling without express permission is “strictly prohibited” and that crawlers must use their true IP and user-agent identity. The robots.txt file is essentially a pointer to those terms. If you don’t have a research or partner agreement, treat LinkedIn as off-limits for automated collection.

Indeed’s Site Rules prohibit automated systems, and they’re specific about it. Bots, scrapers, spiders, AI, and “Agentic AI” are all called out by name. Their robots.txt was blank when I checked it in March 2026, which creates an odd tension with the “no scraping” terms. Indeed does maintain partner APIs, but those are built for employer integrations and job disposition data, not for scraping search results.

Glassdoor’s terms ban “software or automated agents” for scraping without written permission. Their robots.txt is one of the most explicit I’ve seen. It disallows /job-view/, /search/, and paginated job SERP patterns like _P*.htm* and _IP*. If the robots.txt blocks the exact URL patterns you’d need to crawl, that’s a pretty clear signal.

Monster and CareerBuilder fall under the same Provider Terms, which prohibit “data mining, robots, spiders” and anything that imposes unreasonable load. Monster’s robots.txt is interesting because it allows /jobs/search as a path but disallows the query-parameter version /jobs/search?. It also publishes sitemaps, which may be the intended discovery mechanism for compliant crawlers.

ZipRecruiter’s robots.txt states outright that search result pages “are only allowed to be crawled by Googlebot.” They also block a long list of AI and crawler user agents by name, including GPTBot, ClaudeBot, and Diffbot.

The pattern is clear. These boards don’t want you crawling their search results. Official APIs and partner programmes exist for some of them, but they’re typically scoped to job posting management, not bulk search-result extraction. If your use case requires large-scale SERP data, you either need written permission or a licensed data feed.

Crawl Architecture That Won’t Get You Banned

Assuming you have permission to crawl a job board (or you’re working with a board that permits it), the engineering challenge shifts to doing it without tripping rate limits or burning through your budget on redundant pages.

Partition queries instead of paginating deep. Deep pagination is expensive in every sense. It increases your request volume, triggers rate limits faster, and yields diminishing returns because deeper pages tend to be older listings and duplicates. A better approach is to split one broad query into many narrow ones. Slice by geography, job category, company, or time window (“last 24 hours” or “last 3 days” where the UI supports it). Each narrow query only needs 2-5 pages of results instead of 50+. You get better coverage with fewer total requests.

Use novelty ratios to know when to stop. Once you’re paginating, you need a termination rule. The simplest one that actually works: track what percentage of job IDs on each page are new. If three consecutive pages return less than 5% new IDs, stop. You’ve hit the tail. A secondary check is posting age. If the median posting date on a page is older than your business horizon (say, 30 days), there’s no point going deeper.

Rate control is a hierarchy, not a single knob. You need limits at multiple levels. Per-host limits are the most important for compliance. Per-IP limits prevent self-induced congestion when you’re routing through a proxy pool. Per-path-class limits let you throttle search pages differently from detail pages. And a global budget cap keeps your costs from running away.

The implementation pattern that works well here is a token bucket per host, combined with exponential backoff on errors. A 429 (Too Many Requests) or 503 means back off and retry with jitter. A 403 means stop, don’t retry. It’s a policy decision, not a transient error. And if you detect a CAPTCHA or bot challenge in the response HTML, treat it as a hard stop condition. Don’t try to solve it. Log it, escalate to a compliance review, and move on. This isn’t just an ethical position. Attempting to bypass bot detection is operationally fragile and explicitly banned by most of these boards’ terms. LinkedIn, for example, specifically prohibits masking your IP or user-agent identity.

Default to HTTP clients, not headless browsers. If the data you need is available in the HTML response or a JSON API endpoint, a plain HTTP client is faster and cheaper. It’s also less likely to trigger bot detection. Only escalate to headless browser rendering when required fields are genuinely missing from the server response. And when you do use a browser, cache the rendered output aggressively. Don’t re-render pages you’ve already processed.

Keeping Your Data Clean at Scale

Getting the HTML is half the problem. The other half is making sure you’re not storing 40,000 duplicates of the same senior developer listing posted on three boards and reposted twice.

Deduplication works best as a hierarchy. Start with the native job ID if the board provides one and it’s stable across sessions. If not, normalize the canonical URL by stripping tracking parameters (utm_source, ref, from) and sorting query params. As a fallback, generate a content fingerprint by hashing the normalized title, company name, location, and the first 2,000 characters of the description. Fuzzy matching (edit distance on titles, TF-IDF on descriptions) is a last resort. It’s computationally expensive and introduces false positives at scale.

For change detection, maintain three timestamps per record: first_seen_at, last_seen_at, and last_changed_at. Define which fields are “material” for change purposes. Title, company, location, salary, and employment type are usually the ones that matter. Description edits are noisier and less actionable. When a material field changes, bump the version counter and store both the old and new values. This diff history drives your recrawl scheduler. Jobs that change frequently get recrawled sooner. Stable listings can be sampled at longer intervals.

The completeness question is worth defining explicitly. What percentage of postings have all required fields (title, company, location, posting date, apply URL, description)? How quickly do you capture new postings after they go live? How quickly do you detect edits or closures? Setting SLAs for these metrics lets you allocate crawl budget where it actually matters, which is almost always “new and changed” over “historical backfill.”

What Residential Proxies Cost You

If you’re crawling at any real volume, proxy bandwidth is your largest variable cost. Datacenter proxies are cheaper per GB but get flagged fast on sites with even basic bot detection. Residential proxies route through real consumer IPs, which makes them much harder to distinguish from organic traffic. The tradeoff is price.

Public pricing signals across major providers give you a rough picture. Decodo markets residential proxies starting from $2/GB, which puts them at the low end of the range. Bright Data’s pay-as-you-go pricing shows around $4/GB on their pricing page. SOAX lists plans starting at $3.60/GB. Infatica shows $4/GB for pay-as-you-go. IPRoyal starts at $7/GB for small volumes with discounts at scale.

Pool size and geo coverage matter for job data specifically. If you’re partitioning queries by geography (which you should be, per the architecture section above), you need IPs in the right locations. Decodo states 115M+ residential IPs across 195+ locations with both rotating and sticky session options. Bright Data claims 400M+ IPs. Oxylabs lists 175M+. The numbers are marketing claims and hard to verify independently, but the practical question is whether you can get stable sessions in the geos you need. For US-centric job board crawling, all the major providers have adequate coverage.

Session type matters too. Rotating sessions assign a new IP per request, which is good for search-result pages where each request is independent. Sticky sessions hold the same IP for a configurable window (Decodo offers up to 30 minutes on datacenter and longer on residential), which is better for multi-page sequences where you need to look like one continuous browser session walking through pagination.

Decodo’s $2/GB entry point is worth paying attention to if you’re doing cost modeling. On a typical job board SERP, a rendered HTML page runs 200-500 KB. At $2/GB, that’s roughly $0.001 per page fetch. A crawl that pulls 100,000 pages per month would cost around $100 in proxy bandwidth alone, before you factor in compute and storage. At $4/GB, that same crawl is $200. The gap compounds fast at higher volumes, which is why the per-GB rate is the number that matters most when you’re comparing providers for data-heavy workloads.

One thing worth noting about success rate claims. Decodo advertises 99.86%, SOAX claims 99.95%, and Oxylabs markets “zero CAPTCHAs / zero IP blocking.” These are marketing figures, not SLAs. Real-world success rates depend entirely on the target site, your request patterns, and how aggressively the board’s bot detection is tuned. Don’t pick a provider based on who claims the highest number. Pick based on pricing structure, geo coverage, and session flexibility for your specific crawl workload.