Counselya Logo
Back to Blog
Backend & APIs

Designing Resilient Webhook Queues for Meta WhatsApp Cloud API Integrations

Handling thousands of concurrent customer interactions requires solid event-driven middleware. Learn how to configure Node.js and PostgreSQL message queues with auto-retries.

Counselya EngineeringMay 18, 20268 min read
WhatsApp APIWebhooksNode.jsPostgreSQLEvent-DrivenCRM

When a real-estate developer in Patna deployed our WhatsApp CRM system, their weekend ad campaigns were generating 300–500 leads in a 6-hour window. Every single lead triggered a webhook event from Meta's Cloud API. Without a proper queue architecture, this meant 300–500 simultaneous database writes, notification sends, and CRM record creations — all at once. Three leads were missing. One broker was double-notified. Two follow-up messages were never sent. This article documents exactly how we rebuilt their pipeline into a resilient, zero-leakage webhook queue system.

Why Raw Webhook Handlers Fail at Scale

A naive WhatsApp webhook handler processes events inline: receive request → write to database → send notification → respond with HTTP 200. This works fine for 10 events/minute. At 50 events/minute, you start seeing race conditions on database writes. At 200 events/minute, your server runs out of connection pool slots and Meta's webhook delivery starts receiving 500 errors — which triggers Meta's retry logic, sending the same event multiple times.

  • Meta retries failed webhooks up to 3 times over 24 hours — causing duplicate lead entries.
  • Inline database writes block the HTTP response, increasing timeout risk.
  • Without idempotency keys, retried webhooks create phantom duplicate records in your CRM.
  • A single slow database query can cascade into a 30-second response time under concurrent load.

Architecture: The Three-Layer Queue Pattern

The solution is a three-layer architecture: an ingestion layer that acknowledges Meta's webhook instantly, a queue layer that persists and orders events reliably, and a worker layer that processes events independently with retry logic. This completely decouples the HTTP response from the business logic processing.

Webhook ingestion layer — responds in <10ms

// Layer 1: Webhook Ingestion (Express route)
// Goal: Respond to Meta in < 200ms, no matter what

import crypto from 'crypto'
import { pgQueue } from './queue'

app.post('/webhook/whatsapp', express.json(), async (req, res) => {
  // 1. Verify Meta signature FIRST — reject unauthorized requests
  const signature = req.headers['x-hub-signature-256']
  const expectedSig = 'sha256=' + crypto
    .createHmac('sha256', process.env.META_WEBHOOK_SECRET!)
    .update(JSON.stringify(req.body))
    .digest('hex')
  
  if (signature !== expectedSig) {
    return res.status(403).json({ error: 'Invalid signature' })
  }

  // 2. Extract event ID for idempotency
  const eventId = req.body.entry?.[0]?.id + '-' + Date.now()
  
  // 3. Push to queue IMMEDIATELY — do not process inline
  await pgQueue.add('whatsapp-event', {
    eventId,
    payload: req.body,
    receivedAt: new Date().toISOString(),
  }, {
    jobId: eventId,           // Idempotency: same eventId = ignored duplicate
    attempts: 5,              // Retry up to 5 times on failure
    backoff: { type: 'exponential', delay: 2000 },
  })
  
  // 4. Respond to Meta instantly — queue handles the rest
  res.status(200).json({ status: 'queued' })
})

PostgreSQL as a Durable Queue with pgmq or BullMQ

For regional deployments where Redis is an added operational overhead, PostgreSQL can serve as a highly durable message queue using either `pgmq` (a native Postgres extension) or BullMQ pointed at a Redis instance. We prefer BullMQ with Redis for high-throughput scenarios (>1000 events/hour) because Redis's in-memory speed gives you sub-millisecond enqueue times. For medium volumes (<500 events/hour), a PostgreSQL-backed queue is simpler to operate and provides full ACID guarantees.

Queue worker with idempotency and auto-retry

// Layer 2: Queue Worker (processes events independently)
import { Worker } from 'bullmq'
import { Redis } from 'ioredis'

const connection = new Redis(process.env.REDIS_URL!)

const worker = new Worker('whatsapp-event', async (job) => {
  const { eventId, payload } = job.data
  
  // Check idempotency — skip if already processed
  const alreadyProcessed = await db.query(
    'SELECT 1 FROM processed_events WHERE event_id = $1',
    [eventId]
  )
  if (alreadyProcessed.rows.length > 0) {
    console.log(`[Queue] Skipping duplicate event: ${eventId}`)
    return
  }
  
  // Extract message from Meta's nested payload
  const messages = payload.entry?.[0]?.changes?.[0]?.value?.messages || []
  
  for (const message of messages) {
    // 1. Upsert lead record in CRM
    await db.query(`
      INSERT INTO leads (whatsapp_id, phone, message, source, created_at)
      VALUES ($1, $2, $3, 'whatsapp', NOW())
      ON CONFLICT (whatsapp_id) DO UPDATE SET last_message = EXCLUDED.message
    `, [message.from, message.from, message.text?.body])
    
    // 2. Notify sales broker via templated WhatsApp reply
    await sendBrokerNotification(message.from, message.text?.body)
    
    // 3. Auto-reply to lead with acknowledgment template
    await sendAutoReply(message.from, 'lead_acknowledgment_v2')
  }
  
  // Mark as processed — prevent duplicate processing on retry
  await db.query(
    'INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())',
    [eventId]
  )
}, { connection, concurrency: 5 })

worker.on('failed', (job, err) => {
  console.error(`[Queue] Job ${job?.id} failed:`, err.message)
  // Alerts sent to Slack/WhatsApp after max retries exhausted
})

Monitoring: Dead Letter Queues & Alert Routing

Even with retries, some events will fail permanently — a lead's phone number may be invalid, or an upstream CRM API may be temporarily down. These failed events should never silently disappear. We implement a Dead Letter Queue (DLQ) where permanently failed events are stored with their full error trace, and we route an alert directly to the team's WhatsApp group via the Meta Business API.

  • Set `maxRetriesReached` handler to move failed jobs to a `failed-events` table for manual review.
  • Route DLQ alerts via Meta's /messages API to an internal WhatsApp group — instant visibility.
  • Store the full job payload + error stack in the DLQ — enables one-click replay without data loss.
  • Set up a daily cron job to report DLQ depth — if > 0, it means something in your pipeline needs attention.

The real-estate developer's WhatsApp pipeline went from 3 missing leads per weekend campaign to zero — not a single event lost across 8 subsequent campaigns averaging 400 leads each. The architecture is not complex, but it requires discipline: never process webhook events inline, always verify signatures, always use idempotency keys, and always route failures to a dead letter queue. If your current WhatsApp CRM setup processes events inline, you are almost certainly losing leads during high-volume windows — you just cannot see it happening.

Related Counselya Service

Ready to implement this in your business?

Our team deploys these exact patterns for enterprise clients across India. Book a free technical scoping call.