Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly
The technical architecture, rate limiting strategies, and infrastructure decisions required to scale email sending from 10K to 1M+ emails per month reliably.
# Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly
Sending 1,000 cold emails per month is straightforward - connect a mailbox, use any cold email tool, and you're done. Scaling to 100,000+ emails monthly introduces entirely different technical challenges: rate limits, IP reputation management, queue architecture, and failure recovery.
WarmySender processes 1.2M+ emails monthly across warmup, campaigns, and LinkedIn automation. Building infrastructure that handles this volume reliably required solving dozens of scaling problems.
This deep-dive shares the technical architecture, rate limiting strategies, and hard-learned lessons from scaling email infrastructure 100x over 3 years.
## The Three Scaling Bottlenecks
Most email infrastructure hits breaking points at predictable volumes:
**Bottleneck 1: SMTP Connection Limits (10-20K emails/month)**
SMTP servers limit concurrent connections per host. Gmail allows 15-20 concurrent connections per sending server. Exceed this, and you get `421 4.7.0 Too Many Connections` errors.
**Bottleneck 2: Provider Rate Limits (50-100K emails/month)**
ISPs rate-limit per sending domain. Gmail's initial limit is ~500 emails/day for new domains, scaling to 2,000-10,000/day for established senders. Cross this threshold prematurely, and soft bounces cascade.
**Bottleneck 3: Queue Architecture (100K+ emails/month)**
Basic queue systems (database polling) fail at 100K+ monthly volume. Database locks, orphaned jobs, and thundering herd problems emerge. You need distributed queue architecture (Redis + BullMQ).
WarmySender solved all three. Here's how.
## Architecture Overview: Two-Phase Queue System
WarmySender runs two queue architectures in parallel:
**Phase 1: PostgreSQL-Based Scheduler** (for predictability)
- Handles ~30-50K emails/month per workspace
- Job scheduling via PostgreSQL with cron-like ticks
- Direct SMTP sending from scheduler process
- Suitable for 90% of users
**Phase 2: Redis Queue with BullMQ** (for scale)
- Handles 100K+ emails/month per workspace
- Distributed job processing across multiple workers
- Advanced retry logic, delayed jobs, priority queues
- Automatically activated when volume exceeds Phase 1 capacity
**Why two phases?** Phase 1 is simpler, more predictable, and has lower infrastructure costs. Phase 2 is necessary for scale but adds complexity (Redis costs, worker management). Auto-switching gives small users simplicity and large users scale.
## Phase 1: PostgreSQL Scheduler Architecture
**Components:**
1. **Job Tables** (`campaign_send_jobs`, `warmup_jobs`)
- Columns: `id`, `status`, `run_at`, `mailbox_id`, `prospect_id`, `attempts`
- Indexed on `status + run_at` for fast queries
2. **Scheduler Process** (runs every 60 seconds)
- Query: `SELECT * FROM jobs WHERE status='pending' AND run_at <= NOW() LIMIT 500`
- Group jobs by mailbox (rate limiting)
- Execute up to 400 jobs per minute across all mailboxes
3. **SMTP Connection Pool**
- Maintain 5 connections per SMTP host
- Connection pooler reuses connections across jobs
- Timeout: 60 seconds per connection
**Rate limiting (Phase 1)**:
```typescript
// Per-mailbox rate limit
const mailboxLimit = 40 / day
// Per-provider rate limit (all mailboxes on same provider)
const providerLimits = {
gmail: 150 / minute,
outlook: 100 / minute,
other: 30 / minute
}
// Per-host SMTP connection limit
const smtpHostLimit = 5 concurrent connections
```
**Failure handling**:
- Soft bounce (4xx SMTP code): Retry 3x with exponential backoff (5 min, 30 min, 2 hours)
- Hard bounce (5xx code): Mark failed immediately, don't retry
- Timeout (60s): Retry 3x
- After 3 failed attempts: Mark permanently failed
**Performance characteristics**:
- Throughput: 400 emails/minute = 24,000/hour = 576,000/day (theoretical max)
- Actual: 30-50K/day (rate limits + safety margins)
- Latency: Jobs execute within 1-2 minutes of `run_at` time
## Phase 2: Redis Queue Architecture
When workspaces exceed 50K emails/month, system auto-switches to Phase 2.
**Components:**
1. **BullMQ Queues** (Redis-backed)
- `campaign-queue`: Campaign email jobs
- `warmup-queue`: Warmup email jobs
- `linkedin-queue`: LinkedIn automation jobs
2. **Queue Workers** (separate processes)
- 3-5 workers per queue
- Each worker processes jobs concurrently (5 jobs at a time)
- Workers auto-scale based on queue depth
3. **Sync Worker** (bridges PostgreSQL → Redis)
- Polls PostgreSQL every 30 seconds for new pending jobs
- Enqueues to Redis only for workspaces in Phase 2
- Prevents duplicate enqueueing via status checks
**Why sync worker?** PostgreSQL remains source of truth for job state. Redis is execution layer. Sync worker keeps them aligned.
**Rate limiting (Phase 2)**:
Uses token bucket algorithm per mailbox:
```typescript
class MailboxRateLimiter {
tokens: Map = new Map()
maxTokens = 40 // daily limit per mailbox
refillRate = 40 / (24 * 60) // tokens per minute
async consumeToken(mailboxId: string): Promise {
const current = this.tokens.get(mailboxId) || maxTokens
if (current >= 1) {
this.tokens.set(mailboxId, current - 1)
return true // proceed with send
}
return false // rate limited, delay job
}
refillTokens() {
// Runs every minute
for (const [mailboxId, tokens] of this.tokens) {
this.tokens.set(
mailboxId,
Math.min(maxTokens, tokens + refillRate)
)
}
}
}
```
**Fairness algorithm**: Prevents single workspace from consuming all worker capacity:
```typescript
// Per-workspace concurrent job limit
const FAIRNESS_CAP = 50
if (activeJobsForWorkspace >= FAIRNESS_CAP) {
await job.moveToDelayed(Date.now() + 60000) // delay 1 minute
throw new Error('FAIRNESS_CAP_EXCEEDED')
}
```
**Performance characteristics**:
- Throughput: 1,000+ emails/minute = 60,000/hour = 1.4M/day (theoretical)
- Actual: 150-300K/day (rate limits + fairness caps)
- Latency: Jobs execute within 30-60 seconds of `run_at` time
## SMTP Connection Pooling: Solving "Too Many Connections"
Early architecture created new SMTP connection for each email. At 100 emails/minute, this overwhelmed SMTP servers.
**Problem**: Gmail limits to 15-20 concurrent connections from single IP.
**Solution**: Connection pooling with per-host limits:
```typescript
class SMTPConnectionPool {
pools: Map = new Map()
limits: Map = new Map()
async getConnection(mailbox: Mailbox): Promise {
const host = mailbox.smtpHost
const limit = this.getHostLimit(host)
let pool = this.pools.get(host) || []
// Reuse existing connection if available
const available = pool.find(c => !c.inUse)
if (available) {
available.inUse = true
return available
}
// Create new connection if under limit
if (pool.filter(c => c.inUse).length < limit) {
const newConn = await this.createConnection(mailbox)
newConn.inUse = true
pool.push(newConn)
this.pools.set(host, pool)
return newConn
}
// Wait for connection to become available
return this.waitForConnection(host)
}
getHostLimit(host: string): number {
// Gmail/Google Workspace
if (host.includes('gmail.com') || host.includes('googlemail.com')) {
return 15
}
// Outlook/Office365
if (host.includes('outlook') || host.includes('office365')) {
return 10
}
// Default for other providers
return 5
}
}
```
**Result**: Reduced 421 errors by 95%. Throughput increased from 150/min to 400/min.
## Provider-Level Rate Limiting: Avoiding Thundering Herds
Individual mailbox limits (40/day) prevent per-account issues. But when you have 300+ mailboxes on Gmail, even if each stays under 40/day, the aggregate can trigger provider-level rate limits.
**Problem**: 300 Gmail mailboxes × 40/day = 12,000 Gmail emails/day. Gmail's shared infrastructure sees all traffic from your sending IPs and may throttle.
**Solution**: Dynamic provider-level limits that scale with mailbox count:
```typescript
function getProviderDailyLimit(
provider: string,
mailboxCount: number
): number {
const baseLimits = {
gmail: 150, // emails per minute across ALL Gmail mailboxes
outlook: 100,
yahoo: 50,
other: 30
}
const baseLimit = baseLimits[provider] || baseLimits.other
// Scale limit based on mailbox count
const scaleFactor = Math.ceil(mailboxCount / 150)
return baseLimit * scaleFactor
}
// Example:
// 50 Gmail mailboxes: 150/min (no scaling needed)
// 300 Gmail mailboxes: 300/min (2x scale factor)
// 600 Gmail mailboxes: 600/min (4x scale factor)
```
**Jitter injection** prevents synchronized retries (thundering herd):
```typescript
// When rate limit hit, delay randomly
const delayMs = Math.random() * (5 * 60 * 1000) // 0-5 minutes
await job.moveToDelayed(Date.now() + delayMs)
```
**Result**: Eliminated rate limit cascades. Previously, 371 Hostinger mailboxes would hit limits simultaneously and retry together 60 seconds later, creating infinite loop. Jitter spreads retries over 5 minutes.
## Timeout Management: Preventing Scheduler Hangs
SMTP operations can hang indefinitely if remote server stops responding. Without timeouts, scheduler process deadlocks.
**Three-layer timeout architecture:**
**Layer 1: SMTP Connection Timeout (60s)**
```typescript
const transporter = nodemailer.createTransport({
host: mailbox.smtpHost,
port: mailbox.smtpPort,
connectionTimeout: 60000, // 60s
greetingTimeout: 30000, // 30s
socketTimeout: 60000 // 60s
})
```
**Layer 2: Per-Mailbox Timeout (2 minutes)**
```typescript
// Wrap entire send operation
async function sendWithTimeout(mailbox, message) {
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('MAILBOX_TIMEOUT')), 120000)
)
const sendPromise = sendEmail(mailbox, message)
return Promise.race([sendPromise, timeoutPromise])
}
```
**Layer 3: Task-Level Timeout (5 minutes)**
```typescript
// Entire scheduler tick must complete in 5 minutes
async function schedulerTick() {
const tickTimeout = setTimeout(() => {
logger.error('Scheduler tick exceeded 5 minutes')
process.exit(1) // Force restart
}, 300000)
try {
await executePendingJobs()
} finally {
clearTimeout(tickTimeout)
}
}
```
**Result**: Eliminated scheduler hangs completely. Previously, 1 unresponsive SMTP server could freeze entire scheduler for hours.
## Recovery Systems: Handling Failures Gracefully
At scale, failures are constant. Infrastructure must recover automatically.
### 1. Stuck Job Recovery
Jobs can get stuck in "processing" status if worker crashes mid-execution.
**Recovery logic** (runs every 5 minutes):
```typescript
async function recoverStuckJobs() {
const cutoff = new Date(Date.now() - 10 * 60 * 1000) // 10 min ago
const stuckJobs = await db
.select()
.from(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'processing'),
lt(campaignSendJobs.runAt, cutoff)
)
)
for (const job of stuckJobs) {
if (job.attempts >= 3) {
// Permanent failure after 3 attempts
await db.update(campaignSendJobs)
.set({ status: 'failed' })
.where(eq(campaignSendJobs.id, job.id))
} else {
// Retry
await db.update(campaignSendJobs)
.set({
status: 'pending',
runAt: new Date(Date.now() + 5 * 60 * 1000),
attempts: job.attempts + 1
})
.where(eq(campaignSendJobs.id, job.id))
}
}
}
```
### 2. Mailbox Health Checks
Mailboxes can become unusable (password changed, OAuth token expired) mid-campaign.
**Health check** (runs before each send attempt):
```typescript
async function ensureMailboxUsable(mailboxId: string): Promise {
const mailbox = await getMailbox(mailboxId)
// Check status
if (mailbox.status !== 'connected') {
logger.warn(`Mailbox ${mailboxId} status: ${mailbox.status}`)
return false
}
// Check OAuth token expiry (for OAuth mailboxes)
if (mailbox.authType === 'oauth' && mailbox.accessTokenExpiresAt) {
if (new Date(mailbox.accessTokenExpiresAt) < new Date()) {
await refreshOAuthToken(mailbox)
}
}
return true
}
```
### 3. Circuit Breaker for Bad SMTP Hosts
Some SMTP servers become unresponsive (network issues, DDoS, maintenance). Continuing to send wastes worker capacity.
**Circuit breaker pattern**:
```typescript
const HOST_CIRCUIT_BREAKER = new Map()
async function sendWithCircuitBreaker(mailbox, message) {
const host = mailbox.smtpHost
const breaker = HOST_CIRCUIT_BREAKER.get(host) || {
failures: 0,
openedAt: null
}
// If circuit open, skip send
if (breaker.openedAt) {
const minutesSinceOpen =
(Date.now() - breaker.openedAt.getTime()) / 60000
if (minutesSinceOpen < 5) {
throw new Error('CIRCUIT_BREAKER_OPEN')
} else {
// Try again after 5 minutes
breaker.openedAt = null
breaker.failures = 0
}
}
try {
await sendEmail(mailbox, message)
breaker.failures = 0 // Reset on success
} catch (error) {
breaker.failures++
if (breaker.failures >= 5) {
breaker.openedAt = new Date()
logger.warn(`Circuit breaker OPEN for ${host}`)
}
throw error
}
HOST_CIRCUIT_BREAKER.set(host, breaker)
}
```
**Result**: Reduced wasted timeout attempts by 80%. Bad hosts detected within 5 minutes instead of consuming worker capacity for hours.
## Data Retention: Managing 1M+ Job Records
At 1M+ emails monthly, job tables grow to millions of rows. Queries slow without aggressive cleanup.
**Retention policy**:
- Keep `completed` jobs for 7 days
- Keep `failed` jobs for 30 days (for debugging)
- Delete immediately: `expired` jobs (rescheduled, no longer relevant)
**Cleanup job** (runs every 5 minutes):
```typescript
async function cleanupOldJobs() {
// Delete completed jobs older than 7 days
await db.delete(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'completed'),
lt(
campaignSendJobs.scheduledDayUtc,
new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
)
)
)
// Delete failed jobs older than 30 days
await db.delete(campaignSendJobs)
.where(
and(
eq(campaignSendJobs.status, 'failed'),
lt(
campaignSendJobs.scheduledDayUtc,
new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
)
)
)
// Delete expired jobs immediately
await db.delete(campaignSendJobs)
.where(eq(campaignSendJobs.status, 'expired'))
}
```
**Indexing strategy**:
```sql
-- Fast cleanup queries
CREATE INDEX idx_jobs_cleanup
ON campaign_send_jobs(status, scheduled_day_utc);
-- Fast scheduler queries
CREATE INDEX idx_jobs_pending
ON campaign_send_jobs(status, run_at)
WHERE status = 'pending';
```
**Result**: Cleanup deletes 50K-100K rows per day in <2 seconds. Scheduler queries stay under 100ms even with 5M+ total job records.
## Monitoring and Alerting
Infrastructure at scale requires real-time visibility.
**Key metrics tracked**:
- Queue depth (pending jobs per queue)
- Worker throughput (jobs/minute per worker)
- Error rates (by error type: bounce, timeout, auth failure)
- SMTP latency (p50, p95, p99)
- Rate limit hits (by provider)
**Alerts configured**:
- Queue depth > 10,000 for 10+ minutes (capacity issue)
- Error rate > 5% for 5+ minutes (systematic failure)
- Worker throughput drops by 50% (worker crash)
- SMTP p99 latency > 30 seconds (network issue)
**Dashboard**: Real-time Grafana dashboard shows all metrics with 1-minute resolution.
## Lessons Learned: What We'd Do Differently
**1. Start with Phase 2 architecture**
We built Phase 1 first, thinking it would scale longer. It didn't. Migrating to Phase 2 took 6 months. If starting today, we'd use Redis from day one.
**2. Circuit breakers from the beginning**
We added circuit breakers after 6-month outage caused by 3 unresponsive SMTP hosts. Should have been in initial architecture.
**3. More aggressive data retention**
Our first retention policy kept jobs for 90 days. This caused table bloat (20M+ rows). Reducing to 7-30 days improved query speed 10x.
## Conclusion: Infrastructure Before Features
Most cold email tools focus on features (AI personalization, A/B testing, analytics). Few invest in infrastructure that scales reliably.
WarmySender's infrastructure handles 1M+ monthly emails because we prioritized:
- Two-phase queue architecture (simplicity for small users, scale for large)
- Multi-layer rate limiting (mailbox, provider, SMTP host)
- Comprehensive timeout management (SMTP, mailbox, task-level)
- Automatic recovery systems (stuck jobs, bad mailboxes, circuit breakers)
- Aggressive data retention (keep tables lean)
The result? 99.8% uptime over the past 12 months, even while processing 40M+ emails.
**Scaling your own email infrastructure?** The patterns above work for any system sending 100K+ emails monthly. Start with rate limiting and timeouts - those two prevent 80% of scaling failures.
**Want infrastructure that scales without the engineering effort?** Try [WarmySender.com](https://warmysender.com) - all the scale, none of the operational overhead.