How WarmySender handles load
What this page covers
WarmySender is a 4-pillar outreach platform — Cold Emailing, Email Warmup, LinkedIn Outreach, and Multichannel sequences. Across these pillars we send tens of thousands of emails and LinkedIn actions per day on behalf of customers, and we ramp gracefully past traffic spikes without missing sends, over-sending against caps, or putting customer accounts at risk. This page documents the architecture, the safety guarantees, and what you should expect during peak times.
The short version:
- Our durable job store is the source of truth. Every send is a tracked job. If anything goes wrong (cache hiccup, worker crash, deployment), the job is still there and we recover automatically.
- An in-memory cache layer is the hot path. Atomic cap reservations, queue dispatching, and singleton client connections — all following industry best practices for high-throughput job processing at scale.
- Recurring heals are evergreen. Two 6-hour ticks sweep for any prospects who hit edge-case races (late accepts arriving after a wait_accept timeout, or post-accept follow-ups that didn't enqueue) and self-heal without operator intervention.
- Account safety always wins. Caps fail-CLOSED if the cache layer is unavailable. Circuit breakers EXPIRE-and-defer instead of throw-and-retry. The platform never bursts past LinkedIn or email-provider rate limits, even when the queue is hot.
Queue architecture
Every send (email or LinkedIn) is materialized as a tracked job with status pending when it's first scheduled. A scheduler tick reads pending jobs whose run_at has arrived and pushes them onto a high-throughput dispatch queue. Worker processes pull jobs from the queue, attempt the send via the appropriate provider (SMTP for email; our LinkedIn integration for LinkedIn), and write the result back to durable storage.
The two-layer design (durable store for source-of-truth + cache layer for fast dispatching) means we get the best of both worlds:
- Durability. Durable rows survive worker crashes, deploy restarts, and cache-layer outages. Even if the cache layer goes down for an hour, the work is still there and resumes when it comes back.
- Throughput. The cache layer lets us dispatch jobs at sub-millisecond latency. The queue manager handles retries, backoff, and concurrency limits. We don't have to scan the durable store for every dispatch.
- Recovery. If a worker dies mid-send, the job's
processing_started_attimestamp lets a watchdog recover it (we flip stale processing rows back to pending after a configurable timeout — typically 15 minutes for email, longer for LinkedIn). Post-batch silent drops also auto-heal: a 5-second probe after every batch dispatch verifies every job ID still exists in the queue, and if any are missing, the corresponding durable rows flip back to pending so the next sync tick re-enqueues them.
For a deeper dive on the Phase 1 (in-process) vs Phase 2 (durable-queue) auto-switch architecture, see the main documentation.
Cap enforcement
LinkedIn campaigns enforce per-action daily caps via an atomic counter script. The full design is documented in How invite, message, and InMail caps work together; here's the load-handling summary:
- Atomic increment with rollback. Each campaign + action_type + UTC date has a counter. Before any send, we run a script that atomically increments the counter and rolls back if the new count exceeds the cap. Two concurrent workers cannot both be allowed past the boundary — the script runs as one atomic op.
- Cold-start backfill from durable storage. If the counter is missing (cache eviction, fresh deploy), we count completed sends from the durable store for the matching action+day and seed the counter with TTL-to-midnight-UTC. Subsequent increments run on the warm counter.
- Fail-CLOSED on cache-layer outage. If the cache layer is unavailable, the cap helper returns
circuit-openand the caller falls back to a durable-store count gate. Per CLAUDE.md "account safety always wins" — banned-domain risk is greater than missed-send risk, so we err on the side of NOT sending during a cap-system outage. - TTL-to-midnight-UTC. Every cap counter auto-expires at the day rollover, so we don't accumulate stale keys.
For email campaigns, daily limits are enforced at the mailbox-pacer layer with a similar atomic-counter pattern keyed on mailbox + UTC date. The fail-CLOSED behavior is identical: if the pacer can't talk to the cache layer, we defer the send rather than burst.
Recurring heals (the evergreen sweepers)
Some race conditions cannot be fully eliminated at write time — LinkedIn webhook delivery skew, post-accept pipeline transients, timing-window edge cases at the boundary between wait_accept timeout and accept arrival. Rather than hand-waving "it's eventually consistent," we run two recurring heal sweeps every 6 hours that find any prospect stuck in one of these edge cases and self-heal:
- Late-accepts heal. Sweeps for prospects whose accept webhook arrived AFTER the matcher had already excluded them (per late-accept SAFE design). For each, stamps
late_accept_observed_aton a separate column that no forward-mover reads, and (if the campaign has a configured post-accept step) enqueues ONE follow-up message via the existingcampaign_send_jobspipeline so caps/ramps/cooldowns are still enforced. The user-visible result: the dashboard shows the late accept and the follow-up fires at the next valid sending slot. - Stuck post-accepts heal. Sweeps for prospects with
linkedin_accepted_atpopulated whose post-acceptance follow-up never enqueued (transient pipeline failure, expired job that wasn't replaced). Dispatches one follow-up per row via the canonical pipeline insert path.
Both sweeps:
- Make zero new LinkedIn API calls directly — they only stamp internal sentinels and enqueue follow-ups via the existing send pipeline.
- Are idempotent — each row has a WHERE-NULL guard on its idempotency column (
late_accept_observed_at IS NULL/post_accept_stuck_followup_sent_at IS NULL), so two ticks back-to-back never double-process. - Are capped at 200 rows per tick as defense-in-depth against runaway cohort growth.
- Write a per-tick audit row to
linkedin_eventswithll_ref=LL#329so ops can grep and verify. - Stagger their first ticks 30 minutes apart so they never both fire on the same wall-clock minute.
- Can be disabled via
LINKEDIN_RECURRING_HEALS_ENABLED=falseas an off-switch (default ON).
Why 6 hours and not faster? Because the cohorts these sweeps target are small (typically 0–20 rows platform-wide at any given time after a deploy) and the underlying live engine is already self-healing for the common case. Six hours is a sweet spot between "fast enough that customers don't notice the gap" and "infrequent enough that we never thrash the database."
Cache-layer best practices we follow
We follow industry-standard best practices for high-throughput cache use across every cache-touching code path:
- One singleton connection per process. We never create a new cache client per request, per tick, or per worker — that would burn connection slots quickly. The shared client is set up at module load with bounded retries, TCP keepalive to prevent idle eviction, and offline-queue disabled (reject commands when disconnected rather than silently buffer).
- Bounded retry strategy. The retry function never permanently kills the connection; it always retries with capped exponential backoff (2s, 4s, 6s, ... cap 30s).
- TTL on every key. Cap counters expire at midnight UTC. Lock keys have explicit TTLs (5–30 minutes depending on the lock). Metric counters expire after 7 days. We never accumulate unbounded keys.
- Atomic set-if-not-exists. Distributed locks use single-command atomic set-with-TTL primitives — no race where the value is set but the TTL is missed.
- No full-keyspace scans in production hot path. Every cache read is by exact key, by atomic increment/decrement, or by atomic script.
- Pipelining for batch operations. Where appropriate (e.g., enqueueing many jobs at once), we pipeline commands so we get one round-trip per batch instead of one per command.
- Circuit breakers fail-CLOSED. When the cache layer is unavailable, every cap-touching code path returns "circuit-open" and falls back to a durable-store-only gate that still blocks at-or-above the cap. Per CLAUDE.md "account safety always wins."
Load capacity limits
The platform's current sustained throughput envelope (single deployment region):
- Email sending — capped at the customer's mailbox-level daily limits (typically 30–80/day per Gmail/Outlook mailbox under warmup). Aggregate platform throughput across customers comfortably handles 100,000+ sends per hour with sub-second pacing latency.
- LinkedIn sending — capped at LinkedIn-cited per-account limits (typically 80–100 invites/day for paid accounts). Aggregate throughput is bounded by the integration's own rate limits, which we honor with conservative client-side spacing.
- Webhook ingestion — Our LinkedIn integration delivers webhook events at up to ~10 events/second per workspace. Our handler queue absorbs bursts without dropping events; the inbox-sync queue catches anything that lands during a deploy or transient outage.
- Database writes — Our durable store comfortably handles the platform's current write rate (single-digit thousands of writes per minute at peak). The schema is indexed on every hot-path query so reads stay fast even on multi-million-row tables.
If you ever feel like sends are slower than expected, check the campaign-not-sending diagnosis page first — most "slow" reports trace back to the campaign's configured cap, sending window, or per-account ramp ceiling, not to platform-wide load. We monitor queue depth + processing latency + cap-block rate per workspace, and we'd see the issue before you do if it were systemic.
What "deferred" status means
If you see a campaign or prospect in deferred status, it means the engine tried to send but a safety gate said "not now." Common reasons:
- Daily cap reached. The campaign's per-action cap (invites, messages, or InMails) hit its limit for the UTC day. Sends defer to tomorrow's first sending-window slot. This is the most common deferred reason and is exactly what we want — better deferred than burst.
- Account ramp ceiling. The LinkedIn account is in its first weeks of activity, so the ramp schedule limits daily volume to 10–40/day (gradually increasing). Sends defer to the next day.
- Sending window closed. The campaign's configured sending window (e.g., 9:00–17:00 IST) is closed at the moment the engine tried. Sends queue for the next valid window slot.
- Per-prospect cooldown. The same prospect was sent something within the last N hours; the spacing rule prevents back-to-back sends.
- Mailbox/account paused. The mailbox or LinkedIn account is in
paused,cooldown, ordisconnectedstate. Resume the account or wait for cooldown to lift.
Deferred sends are never lost. They remain in the durable job store with status='pending' and run_at set to the next valid slot. The next scheduler tick picks them up and dispatches when the gate clears. If you want to see what's blocking, the campaign detail page has a "Why deferred?" section that lists the active gates per pending send.
Incident response posture
If something goes wrong (cache-layer outage, LinkedIn integration downtime, durable-store latency spike), our incident response is built around three principles:
- Account safety first. Every safety gate fails-CLOSED. We'd rather defer 10,000 sends for an hour than burst 10 sends past a cap and lose customer accounts. Customer can always wait for sends; a banned LinkedIn account is unrecoverable.
- The durable job store is the source of truth. If the cache layer is down, we still have every pending job. Recovery is "wait for the cache to come back, then resume." No data loss, no double-sending.
- Idempotent recovery. Every recovery script we run is idempotent. Re-running it after a partial recovery does the right thing — it won't re-process rows that already healed.
Post-incident, we publish what happened, what the customer impact was, and what we changed to prevent recurrence — see the missed-accept, InMail count correction, and other troubleshooting pages for examples of past incidents and how they shaped the platform.
Common questions
Will my campaign keep sending if the cache layer goes down?
Email campaigns: yes, with degraded-mode pacing. The mailbox-pacer falls back to a durable-store-only gate that still enforces daily cap, but loses atomic-strength under high concurrency. With single-worker deployments (typical), this is functionally equivalent to the cache fast path. LinkedIn campaigns: cap enforcement returns circuit-open, and the caller falls back to the durable-store count gate which still blocks at-or-above the cap. Sends continue but with slightly higher per-send latency. When the cache recovers (typically within minutes), the cap counters cold-start-backfill from the durable store on the next miss — no manual intervention.
How does the platform recover from a worker crash mid-send?
Every job in processing status has a processing_started_at timestamp. A watchdog scans for processing rows older than 15 minutes (email) or longer for LinkedIn, and flips them back to pending for re-dispatch. The send is idempotent at the provider boundary (we use an idempotency key for LinkedIn; SMTP duplicates are rare and detected by the message-ID dedup). Post-batch silent drops also auto-heal at the queue layer.
What's the typical end-to-end latency from "scheduled" to "sent"?
For a job scheduled with run_at = NOW(): the scheduler tick runs every 60 seconds, picks up the job, dispatches to the queue, the worker pulls it within milliseconds, and the actual send happens within a few seconds (provider API latency for LinkedIn; SMTP handshake for email). End-to-end: 60–90 seconds typical, ~3 minutes p99. Recurring heal latency is bounded by the 6-hour tick cadence — see recurring heals above for why we picked that cadence.
How do I know if my campaign is hitting the cap vs. sending normally?
The campaign detail page shows daily counts vs. cap for each action type (invites, messages, InMails). If the cap is hit, the badge color shifts and the "Why deferred?" section lists "daily cap" as the active gate. The platform also logs cap-block events that admin users can grep from the deployment console.
Does the recurring heal scheduler cost extra LinkedIn API quota?
No. The heal sweeps make zero new LinkedIn API calls — they only stamp internal sentinels and enqueue follow-ups via the existing send pipeline. The follow-ups themselves count against the campaign's daily message cap (one slot per follow-up), not invite cap. If the cap is exhausted, the follow-up defers to the next day.
What happens if both ticks (late-accepts + stuck-post-accepts) try to fire at the same time?
They don't, by design — the second tick is staggered 30 minutes after the first. If somehow both did fire, each holds its own distributed lock with a 30-minute TTL, so a second invocation while one is mid-flight returns "lock held" and skips. Multi-pod deployments use the same shared lock, so we never double-heal across pods. If the cache layer is unavailable, each tick falls back to a process-local lock (we may double-heal across pods during a cache outage, but the underlying helper is idempotent at the row level, so the worst case is a few duplicate audit rows — no double-sends).
Is the platform multi-region today?
No, currently single-region. Multi-region adds significant complexity (cap-counter consensus, webhook routing, replication latency) and we haven't seen the throughput need to justify it. If your campaign is highly latency-sensitive (e.g., regulatory windows in Europe), let us know and we'll prioritize accordingly.
How do I reach the team if I see a load issue?
Email hello@warmysender.com with the campaign name, time-window, and what behavior you're seeing (deferred status, slow sends, missed webhook). We'll dig into the queue depth and per-account ramp state for that workspace. For real-time issues, the Support page has the on-call rotation contact.
Related guides
- How invite, message, and InMail caps work together — Cap split semantics + per-account limits + worked examples
- LinkedIn rate limits — Per-account daily and weekly limits, ramp schedule
- What is a "late accept" — The SAFE-design tolerance + recurring heal sweeper
- If you suspect a missed LinkedIn accept — Webhook delivery + polling fallback
- Why isn't my LinkedIn campaign sending? — Common diagnoses for "campaign appears stuck"
- LinkedIn campaign documentation — Schedule, sending windows, ramp, acceptance lag, disconnect flow
- Full documentation — All 90+ guides
- Support — How to get in touch
WarmySender is a 4-pillar outreach platform: Cold Emailing, Email Warmup, LinkedIn Outreach, and Multichannel sequences. Load handling principles apply across all four pillars; specifics differ per provider but the architecture invariants (durable source-of-truth, fast cache hot path, fail-CLOSED safety gates, idempotent recovery) are universal.