Technical Infrastructure

Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly

The technical architecture, rate limiting strategies, and infrastructure decisions required to scale email sending from 10K to 1M+ emails per month reliably.

By Sarah Mitchell • February 5, 2026
# Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly Sending 1,000 cold emails per month is straightforward - connect a mailbox, use any cold email tool, and you're done. Scaling to 100,000+ emails monthly introduces entirely different technical challenges: rate limits, IP reputation management, queue architecture, and failure recovery. WarmySender processes 1.2M+ emails monthly across warmup, campaigns, and LinkedIn automation. Building infrastructure that handles this volume reliably required solving dozens of scaling problems. This deep-dive shares the technical architecture, rate limiting strategies, and hard-learned lessons from scaling email infrastructure 100x over 3 years. ## The Three Scaling Bottlenecks Most email infrastructure hits breaking points at predictable volumes: **Bottleneck 1: SMTP Connection Limits (10-20K emails/month)** SMTP servers limit concurrent connections per host. Gmail allows 15-20 concurrent connections per sending server. Exceed this, and you get `421 4.7.0 Too Many Connections` errors. **Bottleneck 2: Provider Rate Limits (50-100K emails/month)** ISPs rate-limit per sending domain. Gmail's initial limit is ~500 emails/day for new domains, scaling to 2,000-10,000/day for established senders. Cross this threshold prematurely, and soft bounces cascade. **Bottleneck 3: Queue Architecture (100K+ emails/month)** Basic queue systems (database polling) fail at 100K+ monthly volume. Database locks, orphaned jobs, and thundering herd problems emerge. You need distributed queue architecture (Redis + BullMQ). WarmySender solved all three. Here's how. ## Architecture Overview: Two-Phase Queue System WarmySender runs two queue architectures in parallel: **Phase 1: PostgreSQL-Based Scheduler** (for predictability) - Handles ~30-50K emails/month per workspace - Job scheduling via PostgreSQL with cron-like ticks - Direct SMTP sending from scheduler process - Suitable for 90% of users **Phase 2: Redis Queue with BullMQ** (for scale) - Handles 100K+ emails/month per workspace - Distributed job processing across multiple workers - Advanced retry logic, delayed jobs, priority queues - Automatically activated when volume exceeds Phase 1 capacity **Why two phases?** Phase 1 is simpler, more predictable, and has lower infrastructure costs. Phase 2 is necessary for scale but adds complexity (Redis costs, worker management). Auto-switching gives small users simplicity and large users scale. ## Phase 1: PostgreSQL Scheduler Architecture **Components:** 1. **Job Tables** (`campaign_send_jobs`, `warmup_jobs`) - Columns: `id`, `status`, `run_at`, `mailbox_id`, `prospect_id`, `attempts` - Indexed on `status + run_at` for fast queries 2. **Scheduler Process** (runs every 60 seconds) - Query: `SELECT * FROM jobs WHERE status='pending' AND run_at <= NOW() LIMIT 500` - Group jobs by mailbox (rate limiting) - Execute up to 400 jobs per minute across all mailboxes 3. **SMTP Connection Pool** - Maintain 5 connections per SMTP host - Connection pooler reuses connections across jobs - Timeout: 60 seconds per connection **Rate limiting (Phase 1)**: ```typescript // Per-mailbox rate limit const mailboxLimit = 40 / day // Per-provider rate limit (all mailboxes on same provider) const providerLimits = { gmail: 150 / minute, outlook: 100 / minute, other: 30 / minute } // Per-host SMTP connection limit const smtpHostLimit = 5 concurrent connections ``` **Failure handling**: - Soft bounce (4xx SMTP code): Retry 3x with exponential backoff (5 min, 30 min, 2 hours) - Hard bounce (5xx code): Mark failed immediately, don't retry - Timeout (60s): Retry 3x - After 3 failed attempts: Mark permanently failed **Performance characteristics**: - Throughput: 400 emails/minute = 24,000/hour = 576,000/day (theoretical max) - Actual: 30-50K/day (rate limits + safety margins) - Latency: Jobs execute within 1-2 minutes of `run_at` time ## Phase 2: Redis Queue Architecture When workspaces exceed 50K emails/month, system auto-switches to Phase 2. **Components:** 1. **BullMQ Queues** (Redis-backed) - `campaign-queue`: Campaign email jobs - `warmup-queue`: Warmup email jobs - `linkedin-queue`: LinkedIn automation jobs 2. **Queue Workers** (separate processes) - 3-5 workers per queue - Each worker processes jobs concurrently (5 jobs at a time) - Workers auto-scale based on queue depth 3. **Sync Worker** (bridges PostgreSQL → Redis) - Polls PostgreSQL every 30 seconds for new pending jobs - Enqueues to Redis only for workspaces in Phase 2 - Prevents duplicate enqueueing via status checks **Why sync worker?** PostgreSQL remains source of truth for job state. Redis is execution layer. Sync worker keeps them aligned. **Rate limiting (Phase 2)**: Uses token bucket algorithm per mailbox: ```typescript class MailboxRateLimiter { tokens: Map = new Map() maxTokens = 40 // daily limit per mailbox refillRate = 40 / (24 * 60) // tokens per minute async consumeToken(mailboxId: string): Promise { const current = this.tokens.get(mailboxId) || maxTokens if (current >= 1) { this.tokens.set(mailboxId, current - 1) return true // proceed with send } return false // rate limited, delay job } refillTokens() { // Runs every minute for (const [mailboxId, tokens] of this.tokens) { this.tokens.set( mailboxId, Math.min(maxTokens, tokens + refillRate) ) } } } ``` **Fairness algorithm**: Prevents single workspace from consuming all worker capacity: ```typescript // Per-workspace concurrent job limit const FAIRNESS_CAP = 50 if (activeJobsForWorkspace >= FAIRNESS_CAP) { await job.moveToDelayed(Date.now() + 60000) // delay 1 minute throw new Error('FAIRNESS_CAP_EXCEEDED') } ``` **Performance characteristics**: - Throughput: 1,000+ emails/minute = 60,000/hour = 1.4M/day (theoretical) - Actual: 150-300K/day (rate limits + fairness caps) - Latency: Jobs execute within 30-60 seconds of `run_at` time ## SMTP Connection Pooling: Solving "Too Many Connections" Early architecture created new SMTP connection for each email. At 100 emails/minute, this overwhelmed SMTP servers. **Problem**: Gmail limits to 15-20 concurrent connections from single IP. **Solution**: Connection pooling with per-host limits: ```typescript class SMTPConnectionPool { pools: Map = new Map() limits: Map = new Map() async getConnection(mailbox: Mailbox): Promise { const host = mailbox.smtpHost const limit = this.getHostLimit(host) let pool = this.pools.get(host) || [] // Reuse existing connection if available const available = pool.find(c => !c.inUse) if (available) { available.inUse = true return available } // Create new connection if under limit if (pool.filter(c => c.inUse).length < limit) { const newConn = await this.createConnection(mailbox) newConn.inUse = true pool.push(newConn) this.pools.set(host, pool) return newConn } // Wait for connection to become available return this.waitForConnection(host) } getHostLimit(host: string): number { // Gmail/Google Workspace if (host.includes('gmail.com') || host.includes('googlemail.com')) { return 15 } // Outlook/Office365 if (host.includes('outlook') || host.includes('office365')) { return 10 } // Default for other providers return 5 } } ``` **Result**: Reduced 421 errors by 95%. Throughput increased from 150/min to 400/min. ## Provider-Level Rate Limiting: Avoiding Thundering Herds Individual mailbox limits (40/day) prevent per-account issues. But when you have 300+ mailboxes on Gmail, even if each stays under 40/day, the aggregate can trigger provider-level rate limits. **Problem**: 300 Gmail mailboxes × 40/day = 12,000 Gmail emails/day. Gmail's shared infrastructure sees all traffic from your sending IPs and may throttle. **Solution**: Dynamic provider-level limits that scale with mailbox count: ```typescript function getProviderDailyLimit( provider: string, mailboxCount: number ): number { const baseLimits = { gmail: 150, // emails per minute across ALL Gmail mailboxes outlook: 100, yahoo: 50, other: 30 } const baseLimit = baseLimits[provider] || baseLimits.other // Scale limit based on mailbox count const scaleFactor = Math.ceil(mailboxCount / 150) return baseLimit * scaleFactor } // Example: // 50 Gmail mailboxes: 150/min (no scaling needed) // 300 Gmail mailboxes: 300/min (2x scale factor) // 600 Gmail mailboxes: 600/min (4x scale factor) ``` **Jitter injection** prevents synchronized retries (thundering herd): ```typescript // When rate limit hit, delay randomly const delayMs = Math.random() * (5 * 60 * 1000) // 0-5 minutes await job.moveToDelayed(Date.now() + delayMs) ``` **Result**: Eliminated rate limit cascades. Previously, 371 Hostinger mailboxes would hit limits simultaneously and retry together 60 seconds later, creating infinite loop. Jitter spreads retries over 5 minutes. ## Timeout Management: Preventing Scheduler Hangs SMTP operations can hang indefinitely if remote server stops responding. Without timeouts, scheduler process deadlocks. **Three-layer timeout architecture:** **Layer 1: SMTP Connection Timeout (60s)** ```typescript const transporter = nodemailer.createTransport({ host: mailbox.smtpHost, port: mailbox.smtpPort, connectionTimeout: 60000, // 60s greetingTimeout: 30000, // 30s socketTimeout: 60000 // 60s }) ``` **Layer 2: Per-Mailbox Timeout (2 minutes)** ```typescript // Wrap entire send operation async function sendWithTimeout(mailbox, message) { const timeoutPromise = new Promise((_, reject) => setTimeout(() => reject(new Error('MAILBOX_TIMEOUT')), 120000) ) const sendPromise = sendEmail(mailbox, message) return Promise.race([sendPromise, timeoutPromise]) } ``` **Layer 3: Task-Level Timeout (5 minutes)** ```typescript // Entire scheduler tick must complete in 5 minutes async function schedulerTick() { const tickTimeout = setTimeout(() => { logger.error('Scheduler tick exceeded 5 minutes') process.exit(1) // Force restart }, 300000) try { await executePendingJobs() } finally { clearTimeout(tickTimeout) } } ``` **Result**: Eliminated scheduler hangs completely. Previously, 1 unresponsive SMTP server could freeze entire scheduler for hours. ## Recovery Systems: Handling Failures Gracefully At scale, failures are constant. Infrastructure must recover automatically. ### 1. Stuck Job Recovery Jobs can get stuck in "processing" status if worker crashes mid-execution. **Recovery logic** (runs every 5 minutes): ```typescript async function recoverStuckJobs() { const cutoff = new Date(Date.now() - 10 * 60 * 1000) // 10 min ago const stuckJobs = await db .select() .from(campaignSendJobs) .where( and( eq(campaignSendJobs.status, 'processing'), lt(campaignSendJobs.runAt, cutoff) ) ) for (const job of stuckJobs) { if (job.attempts >= 3) { // Permanent failure after 3 attempts await db.update(campaignSendJobs) .set({ status: 'failed' }) .where(eq(campaignSendJobs.id, job.id)) } else { // Retry await db.update(campaignSendJobs) .set({ status: 'pending', runAt: new Date(Date.now() + 5 * 60 * 1000), attempts: job.attempts + 1 }) .where(eq(campaignSendJobs.id, job.id)) } } } ``` ### 2. Mailbox Health Checks Mailboxes can become unusable (password changed, OAuth token expired) mid-campaign. **Health check** (runs before each send attempt): ```typescript async function ensureMailboxUsable(mailboxId: string): Promise { const mailbox = await getMailbox(mailboxId) // Check status if (mailbox.status !== 'connected') { logger.warn(`Mailbox ${mailboxId} status: ${mailbox.status}`) return false } // Check OAuth token expiry (for OAuth mailboxes) if (mailbox.authType === 'oauth' && mailbox.accessTokenExpiresAt) { if (new Date(mailbox.accessTokenExpiresAt) < new Date()) { await refreshOAuthToken(mailbox) } } return true } ``` ### 3. Circuit Breaker for Bad SMTP Hosts Some SMTP servers become unresponsive (network issues, DDoS, maintenance). Continuing to send wastes worker capacity. **Circuit breaker pattern**: ```typescript const HOST_CIRCUIT_BREAKER = new Map() async function sendWithCircuitBreaker(mailbox, message) { const host = mailbox.smtpHost const breaker = HOST_CIRCUIT_BREAKER.get(host) || { failures: 0, openedAt: null } // If circuit open, skip send if (breaker.openedAt) { const minutesSinceOpen = (Date.now() - breaker.openedAt.getTime()) / 60000 if (minutesSinceOpen < 5) { throw new Error('CIRCUIT_BREAKER_OPEN') } else { // Try again after 5 minutes breaker.openedAt = null breaker.failures = 0 } } try { await sendEmail(mailbox, message) breaker.failures = 0 // Reset on success } catch (error) { breaker.failures++ if (breaker.failures >= 5) { breaker.openedAt = new Date() logger.warn(`Circuit breaker OPEN for ${host}`) } throw error } HOST_CIRCUIT_BREAKER.set(host, breaker) } ``` **Result**: Reduced wasted timeout attempts by 80%. Bad hosts detected within 5 minutes instead of consuming worker capacity for hours. ## Data Retention: Managing 1M+ Job Records At 1M+ emails monthly, job tables grow to millions of rows. Queries slow without aggressive cleanup. **Retention policy**: - Keep `completed` jobs for 7 days - Keep `failed` jobs for 30 days (for debugging) - Delete immediately: `expired` jobs (rescheduled, no longer relevant) **Cleanup job** (runs every 5 minutes): ```typescript async function cleanupOldJobs() { // Delete completed jobs older than 7 days await db.delete(campaignSendJobs) .where( and( eq(campaignSendJobs.status, 'completed'), lt( campaignSendJobs.scheduledDayUtc, new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) ) ) ) // Delete failed jobs older than 30 days await db.delete(campaignSendJobs) .where( and( eq(campaignSendJobs.status, 'failed'), lt( campaignSendJobs.scheduledDayUtc, new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) ) ) ) // Delete expired jobs immediately await db.delete(campaignSendJobs) .where(eq(campaignSendJobs.status, 'expired')) } ``` **Indexing strategy**: ```sql -- Fast cleanup queries CREATE INDEX idx_jobs_cleanup ON campaign_send_jobs(status, scheduled_day_utc); -- Fast scheduler queries CREATE INDEX idx_jobs_pending ON campaign_send_jobs(status, run_at) WHERE status = 'pending'; ``` **Result**: Cleanup deletes 50K-100K rows per day in <2 seconds. Scheduler queries stay under 100ms even with 5M+ total job records. ## Monitoring and Alerting Infrastructure at scale requires real-time visibility. **Key metrics tracked**: - Queue depth (pending jobs per queue) - Worker throughput (jobs/minute per worker) - Error rates (by error type: bounce, timeout, auth failure) - SMTP latency (p50, p95, p99) - Rate limit hits (by provider) **Alerts configured**: - Queue depth > 10,000 for 10+ minutes (capacity issue) - Error rate > 5% for 5+ minutes (systematic failure) - Worker throughput drops by 50% (worker crash) - SMTP p99 latency > 30 seconds (network issue) **Dashboard**: Real-time Grafana dashboard shows all metrics with 1-minute resolution. ## Lessons Learned: What We'd Do Differently **1. Start with Phase 2 architecture** We built Phase 1 first, thinking it would scale longer. It didn't. Migrating to Phase 2 took 6 months. If starting today, we'd use Redis from day one. **2. Circuit breakers from the beginning** We added circuit breakers after 6-month outage caused by 3 unresponsive SMTP hosts. Should have been in initial architecture. **3. More aggressive data retention** Our first retention policy kept jobs for 90 days. This caused table bloat (20M+ rows). Reducing to 7-30 days improved query speed 10x. ## Conclusion: Infrastructure Before Features Most cold email tools focus on features (AI personalization, A/B testing, analytics). Few invest in infrastructure that scales reliably. WarmySender's infrastructure handles 1M+ monthly emails because we prioritized: - Two-phase queue architecture (simplicity for small users, scale for large) - Multi-layer rate limiting (mailbox, provider, SMTP host) - Comprehensive timeout management (SMTP, mailbox, task-level) - Automatic recovery systems (stuck jobs, bad mailboxes, circuit breakers) - Aggressive data retention (keep tables lean) The result? 99.8% uptime over the past 12 months, even while processing 40M+ emails. **Scaling your own email infrastructure?** The patterns above work for any system sending 100K+ emails monthly. Start with rate limiting and timeouts - those two prevent 80% of scaling failures. **Want infrastructure that scales without the engineering effort?** Try [WarmySender.com](https://warmysender.com) - all the scale, none of the operational overhead.
email infrastructure scaling technical architecture rate limiting deliverability 2026
Try WarmySender Free