Technical Infrastructure

Scaling Email Infrastructure: How We Handle 1M+ Emails Monthly

The technical architecture, rate limiting strategies, and infrastructure decisions required to scale email sending from 10K to 1M+ emails per month reliably.

By Sarah MitchellCertified Email Marketing Specialist (CEMS), Deliverability Consultant at SendGrid (2016-2020), 500+ successful domain warmup projects February 5, 2026

Sending 1,000 cold emails per month is straightforward - connect a mailbox, use any cold email tool, and you’re done. Scaling to 100,000+ emails monthly introduces entirely different technical challenges: rate limits, IP reputation management, queue architecture, and failure recovery.

WarmySender processes 1.2M+ emails monthly across warmup, campaigns, and LinkedIn automation. Building infrastructure that handles this volume reliably required solving dozens of scaling problems.

This deep-dive shares the technical architecture, rate limiting strategies, and hard-learned lessons from scaling email infrastructure 100x over 3 years.

The Three Scaling Bottlenecks

Most email infrastructure hits breaking points at predictable volumes:

Bottleneck 1: SMTP Connection Limits (10-20K emails/month)

SMTP servers limit concurrent connections per host. Gmail allows 15-20 concurrent connections per sending server. Exceed this, and you get 421 4.7.0 Too Many Connections errors.

Bottleneck 2: Provider Rate Limits (50-100K emails/month)

ISPs rate-limit per sending domain. Gmail’s initial limit is ~500 emails/day for new domains, scaling to 2,000-10,000/day for established senders. Cross this threshold prematurely, and soft bounces cascade.

Bottleneck 3: Queue Architecture (100K+ emails/month)

Basic queue systems (database polling) fail at 100K+ monthly volume. Database locks, orphaned jobs, and thundering herd problems emerge. You need distributed queue architecture (an in-memory job queue backing the workers).

WarmySender solved all three. Here’s how.

Architecture Overview: Two-Phase Queue System

WarmySender runs two queue architectures in parallel:

Phase 1: Database-Based Scheduler (for predictability)

Handles ~30-50K emails/month per workspace
Job scheduling via the database with cron-like ticks
Direct SMTP sending from scheduler process
Suitable for 90% of users

Phase 2: Distributed In-Memory Queue (for scale)

Handles 100K+ emails/month per workspace
Distributed job processing across multiple workers
Advanced retry logic, delayed jobs, priority queues
Automatically activated when volume exceeds Phase 1 capacity

Why two phases? Phase 1 is simpler, more predictable, and has lower infrastructure costs. Phase 2 is necessary for scale but adds complexity (in-memory queue hosting costs, worker management). Auto-switching gives small users simplicity and large users scale.

Phase 1: Database Scheduler Architecture

Components:

Job Tables (campaign_send_jobs, warmup_jobs)
- Columns: id, status, run_at, mailbox_id, prospect_id, attempts
- Indexed on status + run_at for fast queries
Scheduler Process (runs every 60 seconds)
- Query: SELECT * FROM jobs WHERE status='pending' AND run_at <= NOW() LIMIT 500
- Group jobs by mailbox (rate limiting)
- Execute up to 400 jobs per minute across all mailboxes
SMTP Connection Pool
- Maintain 5 connections per SMTP host
- Connection pooler reuses connections across jobs
- Timeout: 60 seconds per connection

Rate limiting (Phase 1):

// Per-mailbox rate limit
const mailboxLimit = 40 / day

// Per-provider rate limit (all mailboxes on same provider)
const providerLimits = {
  gmail: 150 / minute,
  outlook: 100 / minute,
  other: 30 / minute
}

// Per-host SMTP connection limit
const smtpHostLimit = 5 concurrent connections

Failure handling:

Soft bounce (4xx SMTP code): Retry 3x with exponential backoff (5 min, 30 min, 2 hours)
Hard bounce (5xx code): Mark failed immediately, don’t retry
Timeout (60s): Retry 3x
After 3 failed attempts: Mark permanently failed

Performance characteristics:

Throughput: 400 emails/minute = 24,000/hour = 576,000/day (theoretical max)
Actual: 30-50K/day (rate limits + safety margins)
Latency: Jobs execute within 1-2 minutes of run_at time

Phase 2: In-Memory Queue Architecture

When workspaces exceed 50K emails/month, system auto-switches to Phase 2.

Components:

Distributed Queues (in-memory-backed)
- campaign channel queue: Campaign email jobs
- warmup channel queue: Warmup email jobs
- linkedin channel queue: LinkedIn automation jobs
Queue Workers (separate processes)
- 3-5 workers per queue
- Each worker processes jobs concurrently (5 jobs at a time)
- Workers auto-scale based on queue depth
Sync Worker (bridges the database → the in-memory queue)
- Polls the database every 30 seconds for new pending jobs
- Enqueues to the in-memory queue only for workspaces in Phase 2
- Prevents duplicate enqueueing via status checks

Why sync worker? The database remains source of truth for job state. The in-memory queue is the execution layer. Sync worker keeps them aligned.

Rate limiting (Phase 2):

Uses token bucket algorithm per mailbox:

class MailboxRateLimiter {
  tokens: Map<mailboxId, number> = new Map()
  maxTokens = 40 // daily limit per mailbox
  refillRate = 40 / (24 * 60) // tokens per minute

  async consumeToken(mailboxId: string): Promise<boolean> {
    const current = this.tokens.get(mailboxId) || maxTokens

    if (current >= 1) {
      this.tokens.set(mailboxId, current - 1)
      return true // proceed with send
    }

    return false // rate limited, delay job
  }

  refillTokens() {
    // Runs every minute
    for (const [mailboxId, tokens] of this.tokens) {
      this.tokens.set(
        mailboxId,
        Math.min(maxTokens, tokens + refillRate)
      )
    }
  }
}

Fairness algorithm: Prevents single workspace from consuming all worker capacity:

// Per-workspace concurrent job limit
const FAIRNESS_CAP = 50

if (activeJobsForWorkspace >= FAIRNESS_CAP) {
  await job.moveToDelayed(Date.now() + 60000) // delay 1 minute
  throw new Error('FAIRNESS_CAP_EXCEEDED')
}

Performance characteristics:

Throughput: 1,000+ emails/minute = 60,000/hour = 1.4M/day (theoretical)
Actual: 150-300K/day (rate limits + fairness caps)
Latency: Jobs execute within 30-60 seconds of run_at time

SMTP Connection Pooling: Solving “Too Many Connections”

Early architecture created new SMTP connection for each email. At 100 emails/minute, this overwhelmed SMTP servers.

Problem: Gmail limits to 15-20 concurrent connections from single IP.

Solution: Connection pooling with per-host limits:

class SMTPConnectionPool {
  pools: Map<smtpHost, Connection[]> = new Map()
  limits: Map<smtpHost, number> = new Map()

  async getConnection(mailbox: Mailbox): Promise<Connection> {
    const host = mailbox.smtpHost
    const limit = this.getHostLimit(host)

    let pool = this.pools.get(host) || []

    // Reuse existing connection if available
    const available = pool.find(c => !c.inUse)
    if (available) {
      available.inUse = true
      return available
    }

    // Create new connection if under limit
    if (pool.filter(c => c.inUse).length < limit) {
      const newConn = await this.createConnection(mailbox)
      newConn.inUse = true
      pool.push(newConn)
      this.pools.set(host, pool)
      return newConn
    }

    // Wait for connection to become available
    return this.waitForConnection(host)
  }

  getHostLimit(host: string): number {
    // Gmail/Google Workspace
    if (host.includes('gmail.com') || host.includes('googlemail.com')) {
      return 15
    }

    // Outlook/Office365
    if (host.includes('outlook') || host.includes('office365')) {
      return 10
    }

    // Default for other providers
    return 5
  }
}

Result: Reduced 421 errors by 95%. Throughput increased from 150/min to 400/min.

Provider-Level Rate Limiting: Avoiding Thundering Herds

Individual mailbox limits (40/day) prevent per-account issues. But when you have 300+ mailboxes on Gmail, even if each stays under 40/day, the aggregate can trigger provider-level rate limits.

Problem: 300 Gmail mailboxes × 40/day = 12,000 Gmail emails/day. Gmail’s shared infrastructure sees all traffic from your sending IPs and may throttle.

Solution: Dynamic provider-level limits that scale with mailbox count:

function getProviderDailyLimit(
  provider: string,
  mailboxCount: number
): number {
  const baseLimits = {
    gmail: 150,        // emails per minute across ALL Gmail mailboxes
    outlook: 100,
    yahoo: 50,
    other: 30
  }

  const baseLimit = baseLimits[provider] || baseLimits.other

  // Scale limit based on mailbox count
  const scaleFactor = Math.ceil(mailboxCount / 150)

  return baseLimit * scaleFactor
}

// Example:
// 50 Gmail mailboxes: 150/min (no scaling needed)
// 300 Gmail mailboxes: 300/min (2x scale factor)
// 600 Gmail mailboxes: 600/min (4x scale factor)

Jitter injection prevents synchronized retries (thundering herd):

// When rate limit hit, delay randomly
const delayMs = Math.random() * (5 * 60 * 1000) // 0-5 minutes
await job.moveToDelayed(Date.now() + delayMs)

Result: Eliminated rate limit cascades. Previously, 371 Hostinger mailboxes would hit limits simultaneously and retry together 60 seconds later, creating infinite loop. Jitter spreads retries over 5 minutes.

Timeout Management: Preventing Scheduler Hangs

SMTP operations can hang indefinitely if remote server stops responding. Without timeouts, scheduler process deadlocks.

Three-layer timeout architecture:

Layer 1: SMTP Connection Timeout (60s)

const transporter = nodemailer.createTransport({
  host: mailbox.smtpHost,
  port: mailbox.smtpPort,
  connectionTimeout: 60000,  // 60s
  greetingTimeout: 30000,    // 30s
  socketTimeout: 60000       // 60s
})

Layer 2: Per-Mailbox Timeout (2 minutes)

// Wrap entire send operation
async function sendWithTimeout(mailbox, message) {
  const timeoutPromise = new Promise((_, reject) =>
    setTimeout(() => reject(new Error('MAILBOX_TIMEOUT')), 120000)
  )

  const sendPromise = sendEmail(mailbox, message)

  return Promise.race([sendPromise, timeoutPromise])
}

Layer 3: Task-Level Timeout (5 minutes)

// Entire scheduler tick must complete in 5 minutes
async function schedulerTick() {
  const tickTimeout = setTimeout(() => {
    logger.error('Scheduler tick exceeded 5 minutes')
    process.exit(1) // Force restart
  }, 300000)

  try {
    await executePendingJobs()
  } finally {
    clearTimeout(tickTimeout)
  }
}

Result: Eliminated scheduler hangs completely. Previously, 1 unresponsive SMTP server could freeze entire scheduler for hours.

Recovery Systems: Handling Failures Gracefully

At scale, failures are constant. Infrastructure must recover automatically.

1. Stuck Job Recovery

Jobs can get stuck in “processing” status if worker crashes mid-execution.

Recovery logic (runs every 5 minutes):

async function recoverStuckJobs() {
  const cutoff = new Date(Date.now() - 10 * 60 * 1000) // 10 min ago

  const stuckJobs = await db
    .select()
    .from(campaignSendJobs)
    .where(
      and(
        eq(campaignSendJobs.status, 'processing'),
        lt(campaignSendJobs.runAt, cutoff)
      )
    )

  for (const job of stuckJobs) {
    if (job.attempts >= 3) {
      // Permanent failure after 3 attempts
      await db.update(campaignSendJobs)
        .set({ status: 'failed' })
        .where(eq(campaignSendJobs.id, job.id))
    } else {
      // Retry
      await db.update(campaignSendJobs)
        .set({
          status: 'pending',
          runAt: new Date(Date.now() + 5 * 60 * 1000),
          attempts: job.attempts + 1
        })
        .where(eq(campaignSendJobs.id, job.id))
    }
  }
}

2. Mailbox Health Checks

Mailboxes can become unusable (password changed, OAuth token expired) mid-campaign.

Health check (runs before each send attempt):

async function ensureMailboxUsable(mailboxId: string): Promise<boolean> {
  const mailbox = await getMailbox(mailboxId)

  // Check status
  if (mailbox.status !== 'connected') {
    logger.warn(`Mailbox ${mailboxId} status: ${mailbox.status}`)
    return false
  }

  // Check OAuth token expiry (for OAuth mailboxes)
  if (mailbox.authType === 'oauth' && mailbox.accessTokenExpiresAt) {
    if (new Date(mailbox.accessTokenExpiresAt) < new Date()) {
      await refreshOAuthToken(mailbox)
    }
  }

  return true
}

3. Circuit Breaker for Bad SMTP Hosts

Some SMTP servers become unresponsive (network issues, DDoS, maintenance). Continuing to send wastes worker capacity.

Circuit breaker pattern:

const HOST_CIRCUIT_BREAKER = new Map<smtpHost, {
  failures: number,
  openedAt: Date | null
}>()

async function sendWithCircuitBreaker(mailbox, message) {
  const host = mailbox.smtpHost
  const breaker = HOST_CIRCUIT_BREAKER.get(host) || {
    failures: 0,
    openedAt: null
  }

  // If circuit open, skip send
  if (breaker.openedAt) {
    const minutesSinceOpen =
      (Date.now() - breaker.openedAt.getTime()) / 60000

    if (minutesSinceOpen < 5) {
      throw new Error('CIRCUIT_BREAKER_OPEN')
    } else {
      // Try again after 5 minutes
      breaker.openedAt = null
      breaker.failures = 0
    }
  }

  try {
    await sendEmail(mailbox, message)
    breaker.failures = 0 // Reset on success
  } catch (error) {
    breaker.failures++

    if (breaker.failures >= 5) {
      breaker.openedAt = new Date()
      logger.warn(`Circuit breaker OPEN for ${host}`)
    }

    throw error
  }

  HOST_CIRCUIT_BREAKER.set(host, breaker)
}

Result: Reduced wasted timeout attempts by 80%. Bad hosts detected within 5 minutes instead of consuming worker capacity for hours.

Data Retention: Managing 1M+ Job Records

At 1M+ emails monthly, job tables grow to millions of rows. Queries slow without aggressive cleanup.

Retention policy:

Keep completed jobs for 7 days
Keep failed jobs for 30 days (for debugging)
Delete immediately: expired jobs (rescheduled, no longer relevant)

Cleanup job (runs every 5 minutes):

async function cleanupOldJobs() {
  // Delete completed jobs older than 7 days
  await db.delete(campaignSendJobs)
    .where(
      and(
        eq(campaignSendJobs.status, 'completed'),
        lt(
          campaignSendJobs.scheduledDayUtc,
          new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
        )
      )
    )

  // Delete failed jobs older than 30 days
  await db.delete(campaignSendJobs)
    .where(
      and(
        eq(campaignSendJobs.status, 'failed'),
        lt(
          campaignSendJobs.scheduledDayUtc,
          new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
        )
      )
    )

  // Delete expired jobs immediately
  await db.delete(campaignSendJobs)
    .where(eq(campaignSendJobs.status, 'expired'))
}

Indexing strategy:

-- Fast cleanup queries
CREATE INDEX idx_jobs_cleanup
ON campaign_send_jobs(status, scheduled_day_utc);

-- Fast scheduler queries
CREATE INDEX idx_jobs_pending
ON campaign_send_jobs(status, run_at)
WHERE status = 'pending';

Result: Cleanup deletes 50K-100K rows per day in <2 seconds. Scheduler queries stay under 100ms even with 5M+ total job records.

Monitoring and Alerting

Infrastructure at scale requires real-time visibility.

Key metrics tracked:

Queue depth (pending jobs per queue)
Worker throughput (jobs/minute per worker)
Error rates (by error type: bounce, timeout, auth failure)
SMTP latency (p50, p95, p99)
Rate limit hits (by provider)

Alerts configured:

Queue depth > 10,000 for 10+ minutes (capacity issue)
Error rate > 5% for 5+ minutes (systematic failure)
Worker throughput drops by 50% (worker crash)
SMTP p99 latency > 30 seconds (network issue)

Dashboard: Real-time Grafana dashboard shows all metrics with 1-minute resolution.

Lessons Learned: What We’d Do Differently

1. Start with Phase 2 architecture

We built Phase 1 first, thinking it would scale longer. It didn’t. Migrating to Phase 2 took 6 months. If starting today, we’d use an in-memory queue from day one.

2. Circuit breakers from the beginning

We added circuit breakers after 6-month outage caused by 3 unresponsive SMTP hosts. Should have been in initial architecture.

3. More aggressive data retention

Our first retention policy kept jobs for 90 days. This caused table bloat (20M+ rows). Reducing to 7-30 days improved query speed 10x.

Conclusion: Infrastructure Before Features

Most cold email tools focus on features (AI personalization, A/B testing, analytics). Few invest in infrastructure that scales reliably.

WarmySender’s infrastructure handles 1M+ monthly emails because we prioritized:

Two-phase queue architecture (simplicity for small users, scale for large)
Multi-layer rate limiting (mailbox, provider, SMTP host)
Comprehensive timeout management (SMTP, mailbox, task-level)
Automatic recovery systems (stuck jobs, bad mailboxes, circuit breakers)
Aggressive data retention (keep tables lean)

The result? 99.8% uptime over the past 12 months, even while processing 40M+ emails.

Scaling your own email infrastructure? The patterns above work for any system sending 100K+ emails monthly. Start with rate limiting and timeouts - those two prevent 80% of scaling failures.

Want infrastructure that scales without the engineering effort? Try WarmySender.com - all the scale, none of the operational overhead.