How to Keep Critical Systems Online with 99.9% Proxy Uptime

Every proxy vendor on the market will promise you 99.9% uptime. Some even throw around 99.99%. But here’s what those numbers actually mean for your scraping operation, and why you can’t just buy reliability off a pricing page.

Three-nines availability (99.9%) gives you roughly 43.8 minutes of allowed downtime per month. That’s about 8.8 hours across a full year. Sounds generous until you realize a single provider hiccup during a peak crawl window can eat half that budget in one shot.

The bigger issue is how you define “uptime” in the first place. Most vendor SLAs measure server liveness. Your servers are responding to health checks? Great, the uptime clock keeps ticking. But Google’s SRE team has been arguing for years that availability should be measured as request success rate, not server status. For proxy infrastructure, that reframing changes everything. A proxy endpoint can be “up” while returning 403s, timeouts, or CAPTCHAs on 30% of your requests. That’s not 99.9% availability in any meaningful sense.

The SLI that actually matters looks like this:

Proxy data-plane availability = successful proxied requests ÷ total eligible proxied requests

And you need to define “successful” with teeth. A 200 response that returns a CAPTCHA page isn’t a success. A response that takes 45 seconds isn’t a success either, even if it eventually returns data.

When Decodo claims 99.9% uptime or Bright Data advertises their premium SLA, those numbers describe their infrastructure availability, not your end-to-end success rate against whatever targets you’re hitting. Your actual three-nines reliability is something you engineer on top of what the vendor provides. That distinction is worth understanding before you spend a dollar on architecture.

What 99.9% Actually Costs You

Once you accept that proxy uptime is a request-success-rate problem, you need SLOs at three layers. Skip any one of them and you’ll have blind spots.

Control plane. Can you authenticate, obtain IPs, and rotate proxy sessions? This is your vendor’s API availability. If their dashboard is down or session creation is timing out, nothing downstream works.
Data plane. Are outbound requests actually succeeding through proxies with acceptable latency? This is the core “proxy uptime” number and the one most people fixate on.
Pipeline. Are your scraping jobs completing within freshness and completeness targets? A proxy layer can be technically “up” while your pipeline is stuck in a retry loop, delivering stale data 40 minutes late.

The error budget is most useful when you express it in failed requests rather than minutes of downtime. If you’re running 100,000 requests per day and targeting 99.9% success, you get 100 allowed failures. That number is concrete enough to build alerting around.

For alerting itself, threshold-based alerts (“error rate > 5%”) tend to generate noise. The SRE workbook’s approach of multi-window burn-rate alerting works better. You page someone when a fast burn would exhaust the monthly budget in hours, and you open a ticket when a slow burn would exhaust it in days. That separation keeps your on-call rotation sane.

The Architecture That Actually Delivers Three Nines

The core pattern is a proxy gateway sitting between your scraper workers and your proxy vendors. It normalizes vendor APIs, enforces rate limits and retry policy, and handles failover. Without this layer, every scraper is independently managing vendor connections, retry logic, and failure handling. That’s how you get retry storms at 2 AM.

It doesn’t need to be complicated. The gateway routes requests to proxy providers based on health signals, enforces backpressure when things slow down, and trips circuit breakers when a vendor or region goes bad. Your scraper workers stay stateless and simple. They pull jobs from a durable queue, make requests through the gateway, and push results downstream.

Multi-Provider Routing and Vendor Failover

Running all traffic through a single proxy provider is a single point of failure. It doesn’t matter if that provider promises five nines. They will have incidents, region-specific degradations, and rate-limit changes that hit you without warning.

The fix is multi-provider routing behind your gateway. You maintain active connections to at least two providers, with health-based traffic steering. When Provider A’s success rate drops below your threshold, the gateway shifts traffic to Provider B automatically. Envoy’s circuit breaking and outlier detection model is a good reference for how to implement this, whether you’re using Envoy itself or building the logic into a custom gateway.

When evaluating providers for a multi-vendor pool, you want ones with redundant infrastructure and responsive support. Decodo is a reasonable option here, partly because their 24/7 team actually picks up when you need to escalate during a provider-side degradation. Bright Data and Oxylabs are other strong candidates worth testing against your specific target mix.

Graceful degradation is your last resort when capacity is impaired. That means reducing concurrency, prioritizing your highest-value targets, and delaying non-urgent crawls rather than hammering a degraded path with retries.

Scaling Without Breaking Things

Autoscaling proxy gateways and workers on CPU alone will leave you scaling too late. By the time CPU spikes, your queue is already backed up and latency has exploded. Scale on custom metrics instead. Queue depth and in-flight request count are better leading indicators.

Keep workers and gateway pods stateless for fast scaling and clean failover. Session state (when you need stable IP identity for a target) belongs in an external store, not baked into a gateway instance. If that instance dies, the session dies with it.

Connection pooling helps at high throughput. TLS handshakes are expensive, and pooling outbound connections reduces that overhead. But be careful with pooling against targets that track connection fingerprints. A pool of reused connections can look more suspicious than fresh ones, depending on the target’s detection stack.

Stopping Failures Before They Cascade

The fastest way to turn a minor provider hiccup into a full outage is unbounded retries. One provider starts returning errors. Your scrapers retry. Those retries double the load on the gateway. The gateway pushes more traffic to the remaining provider, which starts degrading under the extra volume. Within minutes, everything is down.

Amazon’s Builders’ Library covers this well. Jittered exponential backoff, strict retry budgets (max retries per job and per domain), and fast-fail timeouts are the minimum. Requests that fail with “do not retry” error classes (permanent blocks, authentication failures) should go straight to a dead-letter queue for analysis, not back into the retry loop.

Backpressure needs to be a first-class mechanism, not an afterthought. When your proxy layer slows down, your job scheduler should reduce production. If you’re using durable queues (SQS, RabbitMQ quorum queues, or Kafka with proper replication), the queue absorbs short bursts while backpressure signals slow the upstream. Without this, a 5-minute provider blip generates 30 minutes of cascading queue backup.

Rate limiting per target domain is equally important. Even if your proxy infrastructure is healthy, blasting 500 RPS at a single domain will trigger rate limits and blocks that look like infrastructure failures in your metrics. Throttle per domain, per IP pool, and per account. Your proxy vendor’s uptime means nothing if you’re getting yourself blocked.

Observability That Protects Your Error Budget

You can’t defend an error budget you can’t see. A minimum viable observability stack for proxy systems needs three signal types.

Metrics track your SLIs in real time. Success rate per provider, per region, per target domain. Latency percentiles (p50, p95, p99). Queue depth and saturation. These feed your burn-rate alerts.

Structured logs capture what happened on each request attempt. Vendor used, exit IP class, target domain, HTTP status code, error classification, and response time. When something breaks, these logs tell you why. OpenTelemetry gives you a vendor-neutral collection layer that works with Prometheus, Grafana Loki, Datadog, or whatever backend you prefer.

Distributed traces connect the full path from job scheduling through the proxy gateway to the provider and back. When a job takes 90 seconds instead of 5, traces show you exactly where the time went. Was it DNS? The provider? The target site throttling you?

For alert routing, Prometheus Alertmanager handles the deduplication and grouping that keeps pages actionable. Without proper grouping, a provider degradation that affects 200 scrapers generates 200 separate alerts. That’s noise, not signal.

Running 24/7 Without Burning Out Your Team

Reliable on-call is a system, not a heroic individual answering their phone at 3 AM every night. If your scraping pipeline is business-critical, you need structure around incident response.

Start with runbooks for your top incident classes. A proxy provider outage runbook should cover symptoms (success rate drop on one vendor, latency spike), SLO impact assessment, immediate mitigations (shift traffic via gateway, reduce concurrency), diagnostic queries (top failing domains, provider health dashboard), and escalation to the provider’s support team. Having a provider like Decodo with 24/7 availability makes that last step actually work at 3 AM, rather than waiting until business hours for a response.

Follow-the-sun rotations work if your team spans time zones. If it doesn’t, primary/secondary rotations with explicit hand-off windows keep any single person from carrying the load for too long.

Canary releases matter here too. Routing changes, new parsing logic, and provider configuration updates are all change-risk. Roll them out to a small slice of traffic first and watch the SLIs before going wide. Most scraping outages I’ve seen weren’t caused by provider failures. They were caused by someone pushing a config change without testing it against live traffic.

And after every significant incident, run a blameless postmortem. Not to assign fault, but to find the systemic gap that let a small failure become a big one. The goal is fewer incidents over time, not faster heroics during each one.