From 10 seconds to 3:
refactoring synchronous Lambda APIs into async workflows.
At some point, every team running bulk-data APIs on AWS Lambda hits the same wall: the function works fine for one record, struggles with a hundred, and starts timing out at a few thousand. You raise the timeout. It still times out. You bump the memory. Costs go up, latency stays bad. You wonder if Lambda was the right choice in the first place.
It usually is. The problem is rarely Lambda — it's the shape of the API. On a recent enterprise engagement, we took a synchronous bulk-ingestion endpoint from a P95 of over 10 seconds down to under 3 seconds, by changing how the work was structured rather than how it was processed. This post walks through the pattern, the tradeoffs, and the gotchas we hit shipping it in production.
The setup: when bulk gets bulky
The starting point looked like a thousand other backend systems: a Python Lambda behind API Gateway, accepting a JSON payload of records, writing them to a data warehouse (Redshift), and returning a summary.
# The naive synchronous version
def handler(event, context):
records = event["body"]["records"] # could be 1, could be 50,000
for record in records:
validate(record)
transform(record)
write_to_warehouse(record)
return {"status": "ok", "count": len(records)}
For small batches, this is fine. As volumes grew, two things happened: response times crept past 10 seconds, and downstream services that called this API started hitting their own timeouts. Some payloads were getting rejected midway through processing because the API Gateway 30-second hard limit kicked in.
Why the obvious fixes don't work
The instinct is to scale vertically:
- Raise the Lambda timeout — Lambda allows up to 15 minutes, but API Gateway caps at 29 seconds. So this only helps invocations that don't go through API Gateway.
- Bump memory / CPU — works to a point, but cost scales linearly and you hit diminishing returns fast on database-bound workloads.
- Parallelise inside the function — multiprocessing inside a Lambda doesn't behave the way you think it does. Threads can help for I/O-bound work, but you're still capped by the function's lifetime.
- Add caching — useful for reads, irrelevant for write-heavy bulk ingestion.
None of these address the root cause: you're holding an HTTP connection open while doing arbitrarily long work. The fix isn't to make the work faster — it's to stop holding the connection open.
The pattern: separate the request from the work
The shape we want is the one used by every long-running API in the wild — from S3 multipart uploads to Stripe's webhook retries:
- Client
POSTs the bulk payload - API immediately returns a
job_idand a status URL — typically in well under a second - A worker Lambda processes the actual records in the background
- Client polls (or subscribes to) the status URL until the job completes
This is "request acceptance" decoupled from "work execution." Three Lambdas, three responsibilities:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Submit API │ ──▶ │ Worker │ ──▶ │ Status API │
│ (returns │ │ (processes, │ │ (reads job │
│ job_id) │ │ updates │ │ state from │
│ │ │ state) │ │ S3) │
└─────────────┘ └──────────────┘ └──────────────┘
Implementation: tracking jobs with S3
For the state store, we used S3 with a per-job key. DynamoDB is the more obvious choice, but S3 has a few practical advantages for this pattern:
- It's already there — most teams already have an S3 bucket they're writing artifacts to.
- Atomic writes per object — one PUT replaces the previous state.
- Free reads from CloudFront — if you're polling status from a frontend, you can serve it through a CDN with negligible cost.
- Versioning — built-in audit trail of every state transition, if you enable bucket versioning.
The submit handler becomes minimal:
import json, uuid
import boto3
s3 = boto3.client("s3")
sqs = boto3.client("sqs")
def submit_handler(event, context):
job_id = str(uuid.uuid4())
payload = json.loads(event["body"])
# Persist initial state
s3.put_object(
Bucket=STATE_BUCKET,
Key=f"jobs/{job_id}/state.json",
Body=json.dumps({
"status": "queued",
"total": len(payload["records"]),
"processed": 0,
}),
ContentType="application/json",
)
# Persist the payload (don't put large data in SQS)
s3.put_object(
Bucket=PAYLOAD_BUCKET,
Key=f"jobs/{job_id}/payload.json",
Body=json.dumps(payload),
)
# Trigger worker via SQS
sqs.send_message(
QueueUrl=WORKER_QUEUE_URL,
MessageBody=json.dumps({"job_id": job_id}),
)
return {
"statusCode": 202,
"body": json.dumps({
"job_id": job_id,
"status_url": f"/jobs/{job_id}",
}),
}
Note the 202 Accepted status — the right HTTP code for "I've taken your
request, I'll do the work later." Returning 200 OK here is technically wrong;
most clients won't care, but linters and API consumers that pay attention to status
codes will appreciate the precision.
The worker
The worker Lambda is triggered by SQS, picks up the payload from S3, processes records, and updates state as it goes:
def worker_handler(event, context):
for sqs_message in event["Records"]:
body = json.loads(sqs_message["body"])
job_id = body["job_id"]
payload = load_payload(job_id)
update_state(job_id, status="running", processed=0)
for i, record in enumerate(payload["records"]):
try:
validate(record)
transform(record)
write_to_warehouse(record)
except Exception as e:
update_state(job_id, error=str(e), failed_at=i)
raise # let SQS retry
# Periodic state updates — not every record
if i % 100 == 0:
update_state(job_id, processed=i)
update_state(job_id, status="complete", processed=len(payload["records"]))
Gotcha: don't update state on every single record. S3 PUTs cost money and add latency. Update every N records (we used 100), and always on terminal states (complete, failed).
The status endpoint
The status endpoint is the simplest of the three — it's just a read from S3:
def status_handler(event, context):
job_id = event["pathParameters"]["job_id"]
try:
obj = s3.get_object(
Bucket=STATE_BUCKET,
Key=f"jobs/{job_id}/state.json",
)
return {
"statusCode": 200,
"body": obj["Body"].read(),
}
except s3.exceptions.NoSuchKey:
return {"statusCode": 404, "body": '{"error":"job not found"}'}
What changed (the numbers)
For the engagement this pattern came from, the headline numbers were:
- P95 response time: ~10 seconds → under 3 seconds (~90% reduction)
- Throughput under load: ~492 records/sec sustained, validated with a custom load-testing framework simulating 50 concurrent clients
- API Gateway timeouts: eliminated entirely — the submit endpoint never holds a connection open long enough to time out
Equally important but less quoted: the system became predictable. Failures moved from "the request hung and we don't know what happened" to "this specific record failed at this step, here's the error in S3."
Gotchas, in the order we hit them
1. Cold starts on the worker
SQS-triggered Lambdas inherit Lambda's cold-start behavior. For light workloads, you'll see the first request take a second longer than steady state. For most bulk-ingestion use cases, this doesn't matter — the user already knows the work is async. But if you're chaining many small jobs, provision concurrency on the worker.
2. Idempotency
SQS guarantees at-least-once delivery, not exactly-once. Your worker has
to handle the case where the same job_id arrives twice. We used the
state file as the lock: if state is already "running" or "complete," skip.
3. Payload size
Don't put the bulk payload in the SQS message. SQS has a 256KB message limit,
and you'll hit it surprisingly fast. Persist the payload to S3 in the submit
handler; pass only the job_id through SQS.
4. State update granularity
We mentioned this above but it's worth its own callout. Updating state on every record turns a 10,000-record job into 10,000 S3 PUTs. Bucket every N records. Pick N based on how granular you want the progress UI to feel — 100 is a sane default.
5. Database connections
If your worker hits a relational database directly (Postgres, MySQL), use a
connection layer that's appropriate for the runtime. We saw a ~40% execution-time
improvement by switching from psycopg2 (persistent connection per
invocation) to AWS's Redshift Data API (HTTP-based, no connection pool to manage).
The right choice depends on your warehouse — but evaluate it.
When NOT to use this pattern
Async-first APIs add complexity. Three Lambdas instead of one. State to manage. Polling logic on the client side. Don't reach for this pattern if:
- Your bulk operations consistently complete in well under 5 seconds
- The volume of bulk requests is low enough that the existing sync API isn't actually a bottleneck
- You're building an MVP and "fast enough" beats "architecturally pure"
The pattern earns its keep when bulk requests are unbounded in size, when downstream timeouts are biting you, or when you need to give users meaningful progress feedback on long-running work.
Wrapping up
Refactoring sync-to-async isn't always glamorous. It moves complexity rather than eliminating it — but it moves complexity to the right place: into the system rather than into the user's waiting state. For bulk-ingestion APIs at scale, it's almost always the right call.
The shortest path to a faster API isn't always faster code. Sometimes it's a different contract.
If you're building or refactoring a system like this and want a second pair of eyes — we do this kind of work as part of our engagements at Abhishree Labs. Book a 30-minute discovery call and we'll talk through it.
Got a Lambda problem you'd like a second opinion on?
If your APIs are slow, your costs are climbing, or your architecture is harder to maintain than it should be — let's talk.