Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly
The technical architecture, rate limiting strategies, and infrastructure decisions required to scale email sending from 10K to 1M+ emails per month reliably.
Sending 1,000 cold emails per month is straightforward - connect a mailbox, use any cold email tool, and you’re done. Scaling to 100,000+ emails monthly introduces entirely different technical challenges: rate limits, IP reputation management, queue architecture, and failure recovery.
WarmySender processes 1.2M+ emails monthly across warmup, campaigns, and LinkedIn automation. Building infrastructure that handles this volume reliably required solving dozens of scaling problems.
This deep-dive shares the technical architecture, rate limiting strategies, and hard-learned lessons from scaling email infrastructure 100x over 3 years.
The Three Scaling Bottlenecks
Most email infrastructure hits breaking points at predictable volumes:
Bottleneck 1: SMTP Connection Limits (10-20K emails/month)
SMTP servers limit concurrent connections per host. Gmail allows 15-20 concurrent connections per sending server. Exceed this, and you get 421 4.7.0 Too Many Connections errors.
Bottleneck 2: Provider Rate Limits (50-100K emails/month)
ISPs rate-limit per sending domain. Gmail’s initial limit is ~500 emails/day for new domains, scaling to 2,000-10,000/day for established senders. Cross this threshold prematurely, and soft bounces cascade.
Bottleneck 3: Queue Architecture (100K+ emails/month)
Basic queue systems (database polling) fail at 100K+ monthly volume. Database locks, orphaned jobs, and thundering herd problems emerge. You need distributed queue architecture (Redis + BullMQ).
WarmySender solved all three. Here’s how.
Architecture Overview: Two-Phase Queue System
WarmySender runs two queue architectures in parallel:
Phase 1: PostgreSQL-Based Scheduler (for predictability)
- Handles ~30-50K emails/month per workspace
- Job scheduling via PostgreSQL with cron-like ticks
- Direct SMTP sending from scheduler process
- Suitable for 90% of users
Phase 2: Redis Queue with BullMQ (for scale)
- Handles 100K+ emails/month per workspace
- Distributed job processing across multiple workers
- Advanced retry logic, delayed jobs, priority queues
- Automatically activated when volume exceeds Phase 1 capacity
Why two phases? Phase 1 is simpler, more predictable, and has lower infrastructure costs. Phase 2 is necessary for scale but adds complexity (Redis costs, worker management). Auto-switching gives small users simplicity and large users scale.
Phase 1: PostgreSQL Scheduler Architecture
Components:
-
Job Tables (
campaign_send_jobs,warmup_jobs)- Columns:
id,status,run_at,mailbox_id,prospect_id,attempts - Indexed on
status + run_atfor fast queries
- Columns:
-
Scheduler Process (runs every 60 seconds)
- Query:
SELECT * FROM jobs WHERE status='pending' AND run_at <= NOW() LIMIT 500 - Group jobs by mailbox (rate limiting)
- Execute up to 400 jobs per minute across all mailboxes
- Query:
-
SMTP Connection Pool
- Maintain 5 connections per SMTP host
- Connection pooler reuses connections across jobs
- Timeout: 60 seconds per connection
Rate limiting (Phase 1):
// Per-mailbox rate limit
const mailboxLimit = 40 / day
// Per-provider rate limit (all mailboxes on same provider)
const providerLimits = {
gmail: 150 / minute,
outlook: 100 / minute,
other: 30 / minute
}
// Per-host SMTP connection limit
const smtpHostLimit = 5 concurrent connections
Failure handling:
- Soft bounce (4xx SMTP code): Retry 3x with exponential backoff (5 min, 30 min, 2 hours)
- Hard bounce (5xx code): Mark failed immediately, don’t retry
- Timeout (60s): Retry 3x
- After 3 failed attempts: Mark permanently failed
Performance characteristics:
- Throughput: 400 emails/minute = 24,000/hour = 576,000/day (theoretical max)
- Actual: 30-50K/day (rate limits + safety margins)
- Latency: Jobs execute within 1-2 minutes of
run_attime
Phase 2: Redis Queue Architecture
When workspaces exceed 50K emails/month, system auto-switches to Phase 2.
Components:
-
BullMQ Queues (Redis-backed)
campaign-queue: Campaign email jobswarmup-queue: Warmup email jobslinkedin-queue: LinkedIn automation jobs
-
Queue Workers (separate processes)
- 3-5 workers per queue
- Each worker processes jobs concurrently (5 jobs at a time)
- Workers auto-scale based on queue depth
-
Sync Worker (bridges PostgreSQL → Redis)
- Polls PostgreSQL every 30 seconds for new pending jobs
- Enqueues to Redis only for workspaces in Phase 2
- Prevents duplicate enqueueing via status checks
Why sync worker? PostgreSQL remains source of truth for job state. Redis is execution layer. Sync worker keeps them aligned.
Rate limiting (Phase 2):
Uses token bucket algorithm per mailbox:
class MailboxRateLimiter {
tokens: Map<mailboxId, number> = new Map()
maxTokens = 40 // daily limit per mailbox
refillRate = 40 / (24 * 60) // tokens per minute
async consumeToken(mailboxId: string): Promise<boolean> {
const current = this.tokens.get(mailboxId) || maxTokens
if (current >= 1) {
this.tokens.set(mailboxId, current - 1)
return true // proceed with send
}
return false // rate limited, delay job
}
refillTokens() {
// Runs every minute
for (const [mailboxId, tokens] of this.tokens) {
this.tokens.set(
mailboxId,
Math.min(maxTokens, tokens + refillRate)
)
}
}
}
Fairness algorithm: Prevents single workspace from consuming all worker capacity:
// Per-workspace concurrent job limit
const FAIRNESS_CAP = 50
if (activeJobsForWorkspace >= FAIRNESS_CAP) {
await job.moveToDelayed(Date.now() + 60000) // delay 1 minute
throw new Error('FAIRNESS_CAP_EXCEEDED')
}
Performance characteristics:
- Throughput: 1,000+ emails/minute = 60,000/hour = 1.4M/day (theoretical)
- Actual: 150-300K/day (rate limits + fairness caps)
- Latency: Jobs execute within 30-60 seconds of
run_attime
SMTP Connection Pooling: Solving “Too Many Connections”
Early architecture created new SMTP connection for each email. At 100 emails/minute, this overwhelmed SMTP servers.
Problem: Gmail limits to 15-20 concurrent connections from single IP.
Solution: Connection pooling with per-host limits:
class SMTPConnectionPool {
pools: Map<smtpHost, Connection[]> = new Map()
limits: Map<smtpHost, number> = new Map()
async getConnection(mailbox: Mailbox): Promise<Connection> {
const host = mailbox.smtpHost
const limit = this.getHostLimit(host)
let pool = this.pools.get(host) || []
// Reuse existing connection if available
const available = pool.find(c => !c.inUse)
if (available) {
available.inUse = true
return available
}
// Create new connection if under limit
if (pool.filter(c => c.inUse).length < limit) {
const newConn = await this.createConnection(mailbox)
newConn.inUse = true
pool.push(newConn)
this.pools.set(host, pool)
return newConn
}
// Wait for connection to become available
return this.waitForConnection(host)
}
getHostLimit(host: string): number {
// Gmail/Google Workspace
if (host.includes('gmail.com') || host.includes('googlemail.com')) {
return 15
}
// Outlook/Office365
if (host.includes('outlook') || host.includes('office365')) {
return 10
}
// Default for other providers
return 5
}
}
Result: Reduced 421 errors by 95%. Throughput increased from 150/min to 400/min.
Provider-Level Rate Limiting: Avoiding Thundering Herds
Individual mailbox limits (40/day) prevent per-account issues. But when you have 300+ mailboxes on Gmail, even if each stays under 40/day, the aggregate can trigger provider-level rate limits.
Problem: 300 Gmail mailboxes × 40/day = 12,000 Gmail emails/day. Gmail’s shared infrastructure sees all traffic from your sending IPs and may throttle.
Solution: Dynamic provider-level limits that scale with mailbox count:
function getProviderDailyLimit(
provider: string,
mailboxCount: number
): number {
const baseLimits = {
gmail: 150, // emails per minute across ALL Gmail mailboxes
outlook: 100,
yahoo: 50,
other: 30
}
const baseLimit = baseLimits[provider] || baseLimits.other
// Scale limit based on mailbox count
const scaleFactor = Math.ceil(mailboxCount / 150)
return baseLimit * scaleFactor
}
// Example:
// 50 Gmail mailboxes: 150/min (no scaling needed)
// 300 Gmail mailboxes: 300/min (2x scale factor)
// 600 Gmail mailboxes: 600/min (4x scale factor)
Jitter injection prevents synchronized retries (thundering herd):
// When rate limit hit, delay randomly
const delayMs = Math.random() * (5 * 60 * 1000) // 0-5 minutes
await job.moveToDelayed(Date.now() + delayMs)
Result: Eliminated rate limit cascades. Previously, 371 Hostinger mailboxes would hit limits simultaneously and retry together 60 seconds later, creating infinite loop. Jitter spreads retries over 5 minutes.
Timeout Management: Preventing Scheduler Hangs
SMTP operations can hang indefinitely if remote server stops responding. Without timeouts, scheduler process deadlocks.
Three-layer timeout architecture:
Layer 1: SMTP Connection Timeout (60s)
const transporter = nodemailer.createTransport({
host: mailbox.smtpHost,
port: mailbox.smtpPort,
connectionTimeout: 60000, // 60s
greetingTimeout: 30000, // 30s
socketTimeout: 60000 // 60s
})
Layer 2: Per-Mailbox Timeout (2 minutes)
// Wrap entire send operation
async function sendWithTimeout(mailbox, message) {
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('MAILBOX_TIMEOUT')), 120000)
)
const sendPromise = sendEmail(mailbox, message)
return Promise.race([sendPromise, timeoutPromise])
}
Layer 3: Task-Level Timeout (5 minutes)
// Entire scheduler tick must complete in 5 minutes
async function schedulerTick() {
const tickTimeout = setTimeout(() => {
logger.error('Scheduler tick exceeded 5 minutes')
process.exit(1) // Force restart
}, 300000)
try {
await executePendingJobs()
} finally {
clearTimeout(tickTimeout)
}
}
Result: Eliminated scheduler hangs completely. Previously, 1 unresponsive SMTP server could freeze entire scheduler for hours.
Recovery Systems: Handling Failures Gracefully
At scale, failures are constant. Infrastructure must recover automatically.
1. Stuck Job Recovery
Jobs can get stuck in “processing” status if worker crashes mid-execution.
Recovery logic (runs every 5 minutes):
async function recoverStuckJobs() {
const cutoff = new Date(Date.now() - 10 * 60 * 1000) // 10 min ago
const stuckJobs = await db
.select()
.from(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'processing'),
lt(campaignSendJobs.runAt, cutoff)
)
)
for (const job of stuckJobs) {
if (job.attempts >= 3) {
// Permanent failure after 3 attempts
await db.update(campaignSendJobs)
.set({ status: 'failed' })
.where(eq(campaignSendJobs.id, job.id))
} else {
// Retry
await db.update(campaignSendJobs)
.set({
status: 'pending',
runAt: new Date(Date.now() + 5 * 60 * 1000),
attempts: job.attempts + 1
})
.where(eq(campaignSendJobs.id, job.id))
}
}
}
2. Mailbox Health Checks
Mailboxes can become unusable (password changed, OAuth token expired) mid-campaign.
Health check (runs before each send attempt):
async function ensureMailboxUsable(mailboxId: string): Promise<boolean> {
const mailbox = await getMailbox(mailboxId)
// Check status
if (mailbox.status !== 'connected') {
logger.warn(`Mailbox ${mailboxId} status: ${mailbox.status}`)
return false
}
// Check OAuth token expiry (for OAuth mailboxes)
if (mailbox.authType === 'oauth' && mailbox.accessTokenExpiresAt) {
if (new Date(mailbox.accessTokenExpiresAt) < new Date()) {
await refreshOAuthToken(mailbox)
}
}
return true
}
3. Circuit Breaker for Bad SMTP Hosts
Some SMTP servers become unresponsive (network issues, DDoS, maintenance). Continuing to send wastes worker capacity.
Circuit breaker pattern:
const HOST_CIRCUIT_BREAKER = new Map<smtpHost, {
failures: number,
openedAt: Date | null
}>()
async function sendWithCircuitBreaker(mailbox, message) {
const host = mailbox.smtpHost
const breaker = HOST_CIRCUIT_BREAKER.get(host) || {
failures: 0,
openedAt: null
}
// If circuit open, skip send
if (breaker.openedAt) {
const minutesSinceOpen =
(Date.now() - breaker.openedAt.getTime()) / 60000
if (minutesSinceOpen < 5) {
throw new Error('CIRCUIT_BREAKER_OPEN')
} else {
// Try again after 5 minutes
breaker.openedAt = null
breaker.failures = 0
}
}
try {
await sendEmail(mailbox, message)
breaker.failures = 0 // Reset on success
} catch (error) {
breaker.failures++
if (breaker.failures >= 5) {
breaker.openedAt = new Date()
logger.warn(`Circuit breaker OPEN for ${host}`)
}
throw error
}
HOST_CIRCUIT_BREAKER.set(host, breaker)
}
Result: Reduced wasted timeout attempts by 80%. Bad hosts detected within 5 minutes instead of consuming worker capacity for hours.
Data Retention: Managing 1M+ Job Records
At 1M+ emails monthly, job tables grow to millions of rows. Queries slow without aggressive cleanup.
Retention policy:
- Keep
completedjobs for 7 days - Keep
failedjobs for 30 days (for debugging) - Delete immediately:
expiredjobs (rescheduled, no longer relevant)
Cleanup job (runs every 5 minutes):
async function cleanupOldJobs() {
// Delete completed jobs older than 7 days
await db.delete(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'completed'),
lt(
campaignSendJobs.scheduledDayUtc,
new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
)
)
)
// Delete failed jobs older than 30 days
await db.delete(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'failed'),
lt(
campaignSendJobs.scheduledDayUtc,
new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
)
)
)
// Delete expired jobs immediately
await db.delete(campaignSendJobs)
.where(eq(campaignSendJobs.status, 'expired'))
}
Indexing strategy:
-- Fast cleanup queries
CREATE INDEX idx_jobs_cleanup
ON campaign_send_jobs(status, scheduled_day_utc);
-- Fast scheduler queries
CREATE INDEX idx_jobs_pending
ON campaign_send_jobs(status, run_at)
WHERE status = 'pending';
Result: Cleanup deletes 50K-100K rows per day in <2 seconds. Scheduler queries stay under 100ms even with 5M+ total job records.
Monitoring and Alerting
Infrastructure at scale requires real-time visibility.
Key metrics tracked:
- Queue depth (pending jobs per queue)
- Worker throughput (jobs/minute per worker)
- Error rates (by error type: bounce, timeout, auth failure)
- SMTP latency (p50, p95, p99)
- Rate limit hits (by provider)
Alerts configured:
- Queue depth > 10,000 for 10+ minutes (capacity issue)
- Error rate > 5% for 5+ minutes (systematic failure)
- Worker throughput drops by 50% (worker crash)
- SMTP p99 latency > 30 seconds (network issue)
Dashboard: Real-time Grafana dashboard shows all metrics with 1-minute resolution.
Lessons Learned: What We’d Do Differently
1. Start with Phase 2 architecture
We built Phase 1 first, thinking it would scale longer. It didn’t. Migrating to Phase 2 took 6 months. If starting today, we’d use Redis from day one.
2. Circuit breakers from the beginning
We added circuit breakers after 6-month outage caused by 3 unresponsive SMTP hosts. Should have been in initial architecture.
3. More aggressive data retention
Our first retention policy kept jobs for 90 days. This caused table bloat (20M+ rows). Reducing to 7-30 days improved query speed 10x.
Conclusion: Infrastructure Before Features
Most cold email tools focus on features (AI personalization, A/B testing, analytics). Few invest in infrastructure that scales reliably.
WarmySender’s infrastructure handles 1M+ monthly emails because we prioritized:
- Two-phase queue architecture (simplicity for small users, scale for large)
- Multi-layer rate limiting (mailbox, provider, SMTP host)
- Comprehensive timeout management (SMTP, mailbox, task-level)
- Automatic recovery systems (stuck jobs, bad mailboxes, circuit breakers)
- Aggressive data retention (keep tables lean)
The result? 99.8% uptime over the past 12 months, even while processing 40M+ emails.
Scaling your own email infrastructure? The patterns above work for any system sending 100K+ emails monthly. Start with rate limiting and timeouts - those two prevent 80% of scaling failures.
Want infrastructure that scales without the engineering effort? Try WarmySender.com - all the scale, none of the operational overhead.