Summary
Calling OpenAI, Anthropic, or Bedrock directly from a Fastify route breaks under production load – load balancers time out, clients retry, and every failure becomes two billed calls. The fix: move the LLM call behind a BullMQ + Redis queue, return 202 Accepted with a job ID in under 100 ms, and stream tokens back over Server-Sent Events.
If your Node.js API endpoint that calls OpenAI, Anthropic, or Bedrock has been returning 502s under load, timing out at the load balancer, retrying and double-charging for the same completion, or leaving users staring at a spinner for half a minute, the problem isn’t your timeout values. It’s that you’re doing long-running work inside a request-response cycle that was never designed for it.
These are the exact symptoms our team at ScriptsHub Technologies hit when we shipped our first production AI feature on a standard Fastify API. What started as a two-line openai.chat.completions.create() call turned into a weekend of rolled-back deploys, a 2.8x LLM bill, and a sober rethink of what an AI-backed endpoint really is. This post is the architecture we landed on – a BullMQ + Fastify queue pattern that has handled every subsequent AI feature we’ve shipped without a production incident of the same class.
Why Synchronous OpenAI Calls Break Node.js APIs Under Load
A GPT-4-class completion with a moderate prompt and a thousand output tokens typically takes 8 to 20 seconds. Reasoning models and long contexts can push that to 60 or 90. During that time your Node.js event loop isn’t blocked – but the HTTP connection is open, the browser is counting, the load balancer is counting, and every layer between you and the client has its own opinion about how long a reasonable request should take.
The first time a cloud load balancer’s 60-second idle timeout fires in the middle of a paid LLM call, the client retries. The retry creates a second paid call. The user gets the first response as a timeout and the second as success – or sometimes two responses and confusion. The provider bill shows both calls. Multiply that by every user hitting the same endpoint during a traffic spike, and the bill grows faster than the complaints.
This isn’t a tuning issue. The request-response cycle is the wrong shape for this work, for three structural reasons:
- Retries are infrastructure, not code. Even if you fix your own client, browsers, CDNs, API gateways, and mobile SDK wrappers have their own retry behavior you don’t control.
- You can’t prioritize work. A user-facing request and a nightly batch job look identical to your API. Queues solve this at a primitive level – priority is a property of the enqueued job.
- Synchronous endpoints can’t survive a restart. Every in-flight LLM call is lost on deploy. The provider still bills you; the response vanishes into a closed socket.
Key Takeaways
- Synchronous OpenAI calls inside a Fastify route fail under load because cloud load balancers idle-timeout at 60 seconds while LLM completions can run 8 to 90 seconds.
- The fix is structural, not a tuning change: move the LLM call out of the request-response cycle into a background job queue.
- Client retries on timeout double-bill the provider – every failed completion becomes two paid OpenAI calls.
How OpenAI API Timeouts Broke Our Fastify Endpoint in Production
Our first AI feature was a content summarization endpoint built on a single Fastify route with a synchronous OpenAI call, where the summary was returned directly in the response body. Initially, everything worked smoothly in development. Then it worked in staging. In fact, it even worked for the first two days in production – until real traffic exposed the architectural limits. Then we shipped a launch email.
Within four hours, traffic was running roughly four times our projected baseline, and three failure modes surfaced in quick succession:
- Load balancer timeouts on long completions. The default 60-second idle timeout started firing before long completions finished – about eight percent of requests.
- Client retries creating duplicate paid calls. The frontend had a retry-on-failure helper that treated timeouts as transient errors, so every failed request became two billed calls.
- Provider 429s triggering cascading retries. Because retries were hitting a provider already under pressure, we started seeing 429s, which triggered a third layer of client retries, which produced more 429s.
By the end of the week, our LLM bill had climbed to nearly 2.8× the budget we had allocated for the entire month. Although the product mostly worked, we were effectively paying for every failure twice because client retries kept triggering duplicate OpenAI calls. Even worse, we had almost no observability to distinguish whether “the LLM is slow,” “the provider is rate-limiting us,” or “our own application code is failing under load.”
We killed the feature for 48 hours, wrote a post-mortem, and came back with a different architecture.
How to Architect a BullMQ + Fastify Queue for Long-Running AI Tasks
The architecture we landed on has four components. The hardest part was keeping each one minimal.

Four principles guided every decision:
- The producer does no LLM work. The Fastify route validates, computes an idempotency key, enqueues, and returns in milliseconds.
- The worker is a separate process, not a separate file. You can restart the API independently of the worker, scale them separately, and deploy them on different cadences.
- Redis is the only source of truth. No in-memory job state anywhere – every state transition is a Redis write.
- Idempotency is enforced at enqueue time. Same user + same prompt within a short window = same job. One LLM call, one bill, one result.
Fastify Producer: Enqueue the OpenAI Job and Return 202 Accepted
The Fastify producer is the only part of the OpenAI queue architecture that directly touches the HTTP request cycle. More importantly, two details make this pattern reliable at scale: first, the deterministic idempotency key prevents duplicate billing caused by client retries; second, the 202 Accepted status code clearly tells the client, “I’ve accepted your work – check back later for the result.”

Why This Works
Because the jobId is derived from the input, a retry from a flaky client produces the same job ID, and BullMQ silently deduplicates. The response is 202 Accepted, the correct HTTP semantic for “I’ve taken your work, check back for the result.” The client gets both a polling URL and a streaming URL, and picks based on its environment.
BullMQ Worker: Handle OpenAI Rate Limits and Retries Automatically
The BullMQ worker runs as a separate Node.js process – separate entry point, separate container in production, separate logs, independently restartable. This is where the actual OpenAI streaming chat completion happens, where retries are enforced, and where rate limits are respected. Architecting workers like this requires careful AI model deployment decisions around concurrency, retries, and observability.

Why This Works
concurrency: 8 lets a single worker process run 8 LLM calls in parallel, which is correct because LLM calls are I/O-bound, not CPU-bound. The limiter is BullMQ’s built-in per-queue token bucket capped at 50 jobs per minute – no external rate limiter needed to stay under the provider’s per minute quota. attempts: 3 plus exponential backoff handles transient 429s and 5xxs automatically, with a 2-second initial delay that doubles each attempt. Jobs that fail all 3 attempts land in the failed jobs list – your dead-letter queue by another name.
When to Scale Horizontally
One worker process with concurrency: 8 typically handles 8 to 16 jobs per minute at typical LLM latencies. When you need more, run more worker processes – BullMQ coordinates through Redis, so workers share the load automatically. No leader election, no sharding logic. Horizontal scale is a container replica count, not a code change.
Queueing, streaming, cost controls, and observability – in practice, these are the core infrastructure layers our team at ScriptsHub Technologies manages for engineering teams shipping production-grade LLM features on Node.js.
Streaming LLM Output to the Browser: SSE vs WebSocket
For one-shot LLM completions, Server-Sent Events (SSE) beat WebSockets. SSE is unidirectional (server to client), reconnects automatically, and runs over standard HTTP. Every load balancer, CDN, and corporate proxy already knows what to do with it. WebSockets require an HTTP upgrade handshake that some infrastructure layers handle awkwardly, and the bidirectional channel is capability you pay for with complexity you don’t need when the client isn’t sending anything during the stream. For true chat with ongoing back-and-forth, WebSockets earn their keep. For the one-shot pattern – client submits a prompt, server streams tokens back, connection closes – SSE wins.

Why This Works
The worker doesn’t know or care who is listening – it publishes to a Redis Pub/Sub channel named after the job ID. The SSE endpoint subscribes and forwards messages to the browser. If the client disconnects, we clean up the subscriber immediately so we don’t leak Redis connections. If the client reconnects while the job is still running, it picks up from wherever the stream is. If the job already completed, polling the status URL returns the cached result – which is exactly why removeOnComplete: { age: 3600 } in the producer matters.
Nginx Buffering Will Break Your Stream
If you deploy behind nginx or a similar reverse proxy, set X-Accel-Buffering: no on the SSE response. Without it, nginx buffers the stream and the browser sees all tokens arrive at once at the end. Put it in the boilerplate.
Results: What Fixed When We Moved OpenAI Calls Behind a Queue
We redeployed the summarization endpoint as a queue-based flow within two sprints. Three things changed measurably:
- LLM cost dropped back to budget. Idempotent enqueue eliminated the double-billing that retries were producing. Wasted spend from 429 cascades went to zero because retries now happen inside the worker on a provider-friendly backoff.
- Deploy windows stopped dropping work. Because job state lives in Redis, restarting the API mid-burst no longer loses in-flight work. The worker can be restarted too – BullMQ marks its jobs as stalled and another worker picks them up.
- We got observability we didn’t have before. Queue depth, active job count, retry count per job, and time-to-first-token are all visible now. “The LLM is slow” and “our own code is refusing us” are finally distinguishable.
That’s the arc of the migration. What follows is reference material – when to pick BullMQ over the alternatives, and the seven questions we get most often from teams doing the same thing.
BullMQ vs SQS vs Temporal vs Cloud Tasks: Choosing a Queue for AI Workloads
BullMQ isn’t the only option, and it isn’t right for every team. Here’s how we think about the tradeoff:

Rules of thumb:
- On Node.js with Redis already running? BullMQ is the lowest-friction option.
- On AWS serverless, want pay-per-invocation workers? SQS + Lambda is very clean.
- Multi-step AI workflows – retrieve, generate, evaluate, conditionally retry, wait on human approval? Look at Temporal before you build it on a job queue.
- On GCP? Cloud Tasks is simplest if you don’t want to run Redis.
- Calling the provider directly from an API route in production? The cost of migrating is less than the cost of the next incident. Do it this sprint, not next quarter.
Key Takeaways
- Choose BullMQ+ Redis when you are on Node.js, already running Redis, and handling under 10,000 jobs per minute.
- Choose Temporal when your AI workload is a multi-step workflow with conditional retries, waits, or human approvals-not a single LLM call.
- Choose AWS SQS + Lambda for serverless, intermittent loads on AWS; choose Google Cloud Tasks if you are GCP-native and want to avoid running Redis.
- Never call the provider directly from a production API route-every team that ships AI features hits the same load-balancer-timeout incident within weeks.
Conclusion: Queues Are the Default Shape, Not the Exception
Queues aren’t a scaling hack. They’re the correct shape for any call that takes longer than a user will wait. LLM calls are just the most obvious current example. Once you separate the request from the work, every other problem gets smaller: retries stop double-charging, rate limits become enforceable at the right layer, workers scale independently of the API, and users see progressive feedback instead of a blank spinner.
The pattern above is production-tested, deliberately minimal, and composes well with everything you’ll want to add later – evaluation harnesses, cost dashboards, prompt versioning, request level tracing. None of that is easier to add on top of a synchronous endpoint; all of it gets easier once the queue is there.
The mental shift is the hard part. However, once you stop thinking of LLM calls as “slow API calls” and instead treat them as short-lived background jobs that happen to return a result, the right architecture stops being a debate. In other words, timeouts are only the symptom – whereas the architecture itself is the real issue.
Our team at ScriptsHub Technologies works with product and engineering teams across the US, UK, and India to design and deliver production-grade AI systems on Node.js, Python, and cloud-native stacks. If your AI features are outgrowing your API routes, start here: scriptshub.net
Frequently Asked Questions
Q: Why not just increase the load balancer timeout for OpenAI calls?
A: Because timeouts treat the symptom, not the cause. Longer timeouts still lose in-flight jobs on restarts, can’t prioritize work, and can’t control browser, CDN, or mobile SDK retry behavior. Queues solve all three.
Q: Does BullMQ support streaming LLM responses to the browser?
A: No – BullMQ is a job queue, not a transport layer. The worker publishes each token to a Redis Pub/Sub channel, and a Fastify SSE endpoint subscribes and forwards events to the browser.
Q: How does BullMQ handle OpenAI rate limits (429s)?
A: Two layers. The worker’s limiter option enforces a per-queue token bucket to stay under provider quotas. For transient 429s that slip through, attempts: 3 with exponential backoff retries automatically.
Q: When should I use Temporal instead of BullMQ for AI workloads?
A: Use Temporal for multi-step AI workflows with conditional retries, waits, or human approvals. Use BullMQ when the work is one LLM call plus retry plus result – far less to learn and operate.
Q: Can I use the same Redis instance for BullMQ and my application cache?
A: Yes, but use separate databases (db: 0 for cache, db: 1 for BullMQ), or separate Redis instances if throughput is high. BullMQ’s Lua scripts can collide with freeform cache keys under load.
Q: What is BullMQ used for in Node.js applications?
A: BullMQ is a Redis-backed job queue for Node.js. It runs background work like LLM completions, email, image processing, and scheduled jobs outside the request cycle, with retries, rate limiting, and concurrency built in.
Q: Is BullMQ production-ready for AI workloads?
A: Yes. BullMQ handles AI workloads under 10,000 jobs per minute in production, with native concurrency, retries, rate limiting, and idempotency. For multi-step workflows with human approvals, use Temporal instead.


