SpectreDev | High-Performance Systems Engineering Alternative: SpectreDev

// PUBLISHED20.04.26

// TIME10 MINS

// TAGS

#SYSTEM DESIGN#HIGH TRAFFIC#BACKEND ARCHITECTURE#API DESIGN

// AUTHOR

Spectre Command

n API that works fine at 100 requests per second can become a liability at 10,000. Not because the logic changed, but because the assumptions baked into the design stop holding at scale. Clients retry aggressively. Traffic spikes unpredictably. Downstream services slow down and back-pressure propagates upstream. Payment confirmations arrive twice.

Most of these failure modes are predictable. The patterns that prevent them — rate limiting, versioning, and idempotency — aren't exotic engineering. They're table stakes for any API that handles real traffic. The problem is most teams implement them as afterthoughts, bolted on when something has already broken in production.

This post is about building them in from the start.

Rate Limiting: Protecting Your System From Yourself and Everyone Else

Rate limiting is often framed as a defence against malicious clients — bots, scrapers, bad actors. That's part of it. But the more important use case is protecting your system from legitimate traffic that exceeds what your infrastructure can actually serve.

A flash sale on a regional e-commerce platform. A push notification that sends 2 million users to the same product page simultaneously. A third-party integration that has a bug causing it to retry in a tight loop. All of these are real traffic patterns, all of them are potentially legitimate, and all of them can take down an unprotected API.

Rate limiting is how you define the contract: here's what this system is designed to handle, and here's what happens when you exceed it.

The three most common algorithms:

Token bucket gives each client a bucket that fills with tokens at a fixed rate. Each request consumes a token. When the bucket is empty, requests are rejected or queued. The bucket has a maximum capacity, which means clients can "save up" for short bursts — useful for APIs where occasional spikes are normal but sustained high volume is not.

Leaky bucket processes requests at a fixed output rate regardless of input rate. Excess requests queue (or are dropped). It smooths traffic more aggressively than token bucket and is useful when you need consistent downstream throughput — for example, protecting a database that can't handle burst writes.

Fixed window counts requests in a fixed time window (say, 1,000 requests per minute) and resets at the window boundary. Simple to implement, but has an edge case: a client can send 1,000 requests at 11:59 and another 1,000 at 12:00, effectively hitting 2,000 requests in two minutes without technically violating the rule. Sliding window counters fix this but at higher implementation cost.

The choice between them depends on your traffic pattern and what you're protecting. For most external-facing APIs, token bucket with a sliding window variant is a reasonable default.

Where to implement it: as early in the request path as possible. An API gateway (Kong, AWS API Gateway, Nginx with rate limiting modules) handles this before your application code even sees the request. This matters because rate limiting at the application layer still consumes application resources to reject the request. At the gateway layer, you shed load before it reaches your compute.

The response matters too. A rejected request should return HTTP 429 with a Retry-After header telling the client when it can try again. A X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset header set tells well-behaved clients how to pace themselves. Design for good clients, not just bad ones.

[→ Read: How to build a backend that scales from 100 to 10 million users]

API Versioning: Making Change Without Breaking Your Consumers

APIs are promises. The moment an external client — a mobile app, a partner integration, a third-party developer — starts depending on your API, changing it carries risk. Versioning is how you manage that risk without freezing your system in amber.

The hard truth: there is no perfect versioning strategy. Every approach involves trade-offs, and the right one depends on how your API is consumed.

URI versioning (/v1/orders, /v2/orders) is the most common and the most visible. The version is explicit in the URL, easy to route at the gateway level, and easy to document. The downside is it can encourage treating versions as separate products rather than as an evolving contract — teams end up maintaining /v1 and /v2 as parallel codebases, which compounds maintenance burden quickly.

Header versioning (Accept: application/vnd.spectredev.v2+json) keeps URLs clean and is arguably more semantically correct — the resource identity doesn't change, only the representation does. The trade-off is it's less visible, harder to test in a browser, and more complex to route at the infrastructure layer. It's the right approach for mature API programs; it's probably over-engineering for most startups.

Query parameter versioning (/orders?version=2) is easy to implement and easy to test, but mixes versioning concerns with resource-addressing concerns. Use it for internal tooling if it makes life easier. Don't use it for public APIs.

The versioning strategy matters less than the discipline around when you version. A change that adds a new optional field to a response is backwards compatible — don't version it. A change that removes a field, renames a field, or changes a field's type is breaking — version it. A change that alters the semantics of an existing field (same name, different meaning) is the most dangerous kind because it won't cause a client to fail immediately; it'll cause it to fail silently with wrong data.

Deprecation is part of the contract. When you release /v2, set a clear deprecation timeline for /v1 — six months is common for external APIs, three months is often enough for internal ones. Send Deprecation and Sunset response headers on every /v1 request. Log which clients are still hitting deprecated versions. Reach out to those clients directly before you pull the plug. The teams that handle API versioning well treat it as a communication problem as much as a technical one.

Idempotency: The Pattern That Saves You When the Network Lies

Networks are unreliable. Clients time out and retry. Load balancers reroute mid-request. Mobile apps lose connectivity at exactly the wrong moment and come back online assuming the last request failed.

In a read-heavy API, this is mostly fine — fetching the same resource twice is harmless. In a write-heavy API, it's a serious problem. A payment processed twice is a real financial error. An order created twice is a real fulfilment problem. A user created twice is a real data integrity problem.

Idempotency is the property that says: sending the same request multiple times has the same effect as sending it once. Implementing it correctly is one of the most valuable things you can do for an API that handles financial transactions, order management, or any operation where duplicates are costly.

The standard implementation uses an idempotency key — a unique identifier generated by the client and sent with each request, typically as a header (Idempotency-Key: <UUID>). The server stores the key and the result of the first successful processing. On subsequent requests with the same key, it returns the stored result without re-executing the operation.

The storage mechanism is usually a fast key-value store ( works well here) with a TTL — keys don't need to live forever, just long enough to cover the client's retry window. 24 hours is a common default for payment APIs; 7 days is more conservative for workflows with longer retry cycles.

A concrete example: a GoPay or OVO disbursement request that times out on the client side. Did the money move or not? Without idempotency, retrying is risky. With an idempotency key, the client retries with the same key, the server checks its store, sees the operation already completed, and returns the original successful response. No double disbursement. The client gets the confirmation it needed.

What to store: at minimum, the idempotency key, the response status code, and the response body. Some implementations also store the request body and validate that subsequent requests with the same key have the same body — if a client sends different parameters with the same idempotency key, that's a client bug, and you should return a 422 rather than silently processing the new parameters.

Idempotency keys should be client-generated. The client owns the key because the client is the one recovering from failure. Server-generated idempotency would require the client to have already received the key, which assumes the first request succeeded — defeating the purpose.

[→ Read: What is database sharding — and when does your startup actually need it]

How These Three Patterns Work Together

Rate limiting, versioning, and idempotency are often treated as separate concerns. In a well-designed high-throughput API, they interact.

Rate limiting shapes the load your system accepts. Idempotency handles the safe retry behaviour when requests fail. Versioning ensures that as you improve both of those mechanisms over time, you can do so without breaking existing clients.

A practical scenario: you're running a B2B payments API used by Indonesian SME accounting software integrators — similar to the kind of integrations built on top of platforms like Jurnal or Accurate. Your rate limits are per API key, not per IP, because your clients are businesses making requests on behalf of thousands of end users. Your idempotency implementation covers all POST and PATCH endpoints because those are the ones with real-world side effects. Your versioning is URI-based with a 6-month deprecation cycle because your clients are third-party developers who need predictability.

That's not a complex system. It's a coherent one. Each decision reinforces the others.

One thing to not overlook: documentation. An API with perfect rate limiting, versioning, and idempotency that is poorly documented will still fail in production — because clients will implement integrations incorrectly, hit rate limits they didn't know existed, and retry without idempotency keys because they didn't know they needed them. The OpenAPI spec is not documentation. It's a schema. Documentation explains the why and the what-happens-when.

[→ Read: Monolith vs modular monolith vs microservices: the honest decision framework]

A Note on When to Build This Versus When to Buy It

If you're building a public-facing API today, you probably don't need to implement rate limiting or versioning routing from scratch. API gateways — AWS API Gateway, Kong, Apigee, or the gateway layer of a managed Kubernetes platform — handle the infrastructure concerns and let your application focus on business logic.

What you do need to implement yourself is idempotency, because that's specific to your domain logic and your data model. No gateway can know whether a payment request has already been processed — only your application can.

The mistake we see most often is teams building sophisticated custom rate limiting middleware in their application framework when a gateway would have served them at a tenth of the cost — while simultaneously having no idempotency implementation at all for their payment endpoints, where the stakes are highest.

Spend your engineering effort where it can't be bought.

FAQ

Q: What HTTP status code should I return when a request is rate limited?

A: HTTP 429 (Too Many Requests). Always include a Retry-After header indicating when the client can next attempt the request — either as a number of seconds or an HTTP date. Without this, well-behaved clients can't back off intelligently and you'll see retry storms that compound the load problem you were trying to prevent.

Q: How do I handle idempotency for operations that involve multiple steps or downstream service calls?

A: This is the hard case. If your operation involves multiple downstream calls — update a record, charge a payment, send a notification — idempotency needs to cover the entire sequence, not just individual steps. The safest pattern is to treat the whole operation as a saga: each step is idempotent individually, and the overall operation can be retried from any point of failure. This requires careful state tracking (typically in your database, not just a cache) and is a significant design investment. For most teams, the first step is making the critical path idempotent and accepting that edge cases in complex sagas require manual reconciliation until you've hit that problem enough times to justify the engineering cost.

Q: Should internal APIs — services talking to each other within our own system — also be versioned?

A: With less formality, yes. If two internal services are deployed independently, a breaking change in one can break the other mid-deployment. Contract testing (tools like Pact) is often a better fit for internal APIs than explicit versioning, because it catches breaking changes before deployment rather than managing them after. For services deployed together or tightly coupled by design, a shared contract in code (a shared types library, a protobuf schema) is usually cleaner than versioning the HTTP surface.

Q: What's the right granularity for rate limits — per IP, per user, per API key?

A: It depends on who your clients are. Per-IP is appropriate for unauthenticated public endpoints where you don't yet know who the caller is. Per-user limits are right for authenticated user-facing endpoints where you're protecting against individual abuse. Per-API-key limits are right for B2B or developer APIs where the client is an organisation making requests on behalf of many end users — throttling by IP would punish them for traffic that's legitimately spread across many users. Most mature APIs use a combination: unauthenticated requests rate-limited by IP, authenticated requests by API key or user ID, with different limits for different endpoint tiers.

Q: How long should idempotency keys be stored?

A: Long enough to cover your client's retry window with meaningful margin. For payment APIs, 24 hours is the industry norm — Stripe uses this, for example. For longer-running async workflows where a client might retry over days, 7 days is more conservative. There's a storage cost to longer TTLs if you're storing full response bodies at volume, but at most scales it's negligible. Err on the side of longer and trim based on actual storage pressure, not upfront assumptions.

Rate limiting, versioning, and idempotency aren't the most glamorous parts of API design. They won't make it into your launch post. But they're the difference between an API that holds up when traffic gets real and one that becomes a source of production incidents at the worst possible moment. The patterns are well-understood. The implementation cost is manageable. The cost of not doing it is paid in pages, customer refunds, and emergency architecture work at 2am.

Build it in from the start. Your future on-call self will notice.

Internal Documentation:

External Documentation:

[Stripe API Idempotency documentation]
Jurnal.id and Accurate.id — Indonesian SME accounting software platforms.