Architecture Decisions

This page documents the key design decisions in Mailman, what alternatives were considered, and why we chose what we did. If you're wondering "why does it work this way?", start here.

Hexagonal Architecture (Ports & Adapters)

Decision: Domain logic lives in core with zero I/O dependencies. All external interactions go through trait boundaries.

Why: Email infrastructure has a lot of external dependencies — Postgres, S3, SES, SQS, Redis, ClamAV. If domain logic is coupled to any of these, testing becomes slow and brittle, and swapping implementations requires touching business logic.

Rules:

core never imports sqlx, aws-sdk-*, reqwest, or any I/O crate
Repository traits are defined in core/src/repository.rs, implemented in adapters-* crates
Route handlers in http call trait methods, never concrete adapter types
Binary crates (bin-api, bin-worker) are the only place where concrete types are wired together

What this means in practice: When you write a new feature, the domain logic and the database query are in different crates. This feels like overhead for simple CRUD, but it pays off when:

Integration tests run with real Postgres but mocked AWS (fast, no AWS costs)
You need to change how S3 keys are structured without touching threading logic
A new adapter (e.g., swapping ClamAV for a different scanner) requires zero changes to core

What we rejected: Putting everything in one crate with feature flags for test mocking. This is simpler initially but makes it impossible to enforce boundaries — a single use aws_sdk_s3 in the wrong file breaks the architecture silently.

Customer-Signed JWTs (Not Server-Issued)

Decision: Customers generate their own JWTs with their private key and register the corresponding public key with Mailman.

Why: Mailman is infrastructure, not an identity provider. Customers already have auth systems. Issuing tokens would mean managing sessions, refresh flows, and password resets — all out of scope. The customer-signed model means:

Mailman never sees private keys
Token issuance is the customer's responsibility (their auth service, their CI/CD, etc.)
Key rotation is a public key swap, not a coordinated secret rotation
Multiple services can share a signing key or have separate keys per environment

Fallback: API keys are still supported for backward compatibility and simple integrations where JWT is overkill.

Queued Outbound Delivery (Not Inline SES)

Decision: POST /send stores the message and enqueues a delivery job to SQS. The worker does the actual SES send.

Why: Inline SES calls from the API handler create several problems:

Latency: SES calls take 100-500ms, which makes the API slow for the caller
Reliability: If SES is temporarily down, sends fail and the caller has to retry
Observability: With queued delivery, every message has a tracked lifecycle (pending → sent → delivered/bounced)
Rate control: The worker can enforce volume limits and back off on SES throttling without blocking API callers

Tradeoff: The caller gets 202 Accepted with status: "pending", not immediate confirmation. They must use webhooks or poll for delivery status.

Soft Deletes Everywhere

Decision: All entities use deleted_at TIMESTAMPTZ instead of DELETE FROM.

Why:

Audit trail: Email systems need to answer "what happened to that message?" months later
Recovery: Accidental deletes are recoverable without backups
Referential integrity: Hard-deleting a thread while delivery events reference its messages creates orphans
S3 retention: Raw MIME stays in S3 per retention policy; DB records should match

Cost: Every query needs WHERE deleted_at IS NULL. We use partial indexes on deleted_at IS NULL to keep this fast.

7-Day Subject Matching Window

Decision: Subject-based thread fallback only matches messages from the last 7 days.

Why: Without a time window, unrelated emails with common subjects ("Hello", "Quick question", "Follow up") would incorrectly merge into ancient threads. 7 days covers typical reply cadence while preventing false matches.

Why not configurable? We considered making it configurable per inbox but decided against it — the complexity isn't worth it. If a customer needs different behavior, they should use proper In-Reply-To / References headers, which don't have a time window.

SQS Over Direct Processing

Decision: All async work goes through SQS queues, not direct function calls or database polling.

Why:

Reliability: SQS guarantees at-least-once delivery. If the worker crashes mid-processing, the message becomes visible again after the visibility timeout
Backpressure: Queue depth is a natural metric for scaling. Worker auto-scales on queue depth
Decoupling: The API doesn't need to know if the worker is running. Messages queue up and drain when the worker recovers
Dead letter queues: Failed messages automatically land in the DLQ after max retries, preventing poison pill loops

Exception: Webhook delivery uses database polling (not SQS) because webhooks need per-thread ordering (FIFO) and complex retry scheduling that's easier to express in SQL.

Distroless Container Images

Decision: Runtime images use gcr.io/distroless/cc-debian12:nonroot, not Alpine or full Debian.

Why:

Security: No shell, no package manager, no utilities an attacker could use
Size: ~20MB base vs ~120MB for Debian slim
Compatibility: Our binaries link against glibc (via native-tls). Alpine uses musl, which would require recompilation and has known compatibility issues with some crates

Tradeoff: You can't docker exec into the container for debugging. Use CloudWatch logs and ECS Exec instead.

Single PostgresRepository (Not Per-Entity)

Decision: One PostgresRepository struct implements all repository traits, rather than separate MessageRepo, ThreadRepo, etc.

Why: Having 10+ repository structs that all wrap the same PgPool adds boilerplate with no benefit. A single struct with trait implementations gives:

One place to manage the connection pool
Clean trait boundaries (callers only see the trait they need)
Less wiring in the binary entrypoints

ClamAV for Malware (Not Spam Scoring)

Decision: Scan for malware with ClamAV. Do not score spam.

Why: Mailman is not a user-facing inbox — it's API infrastructure. The customer's application decides what to do with messages. Malware scanning protects our storage (don't persist infected files), but spam scoring is the customer's concern. Adding spam scoring would mean maintaining rules, tuning thresholds, and handling false positives — all outside our scope.

Graceful degradation: ClamAV is optional. If not configured, a noop scanner is used. This lets environments without ClamAV (local dev, CI) work without mocking.

Redis for Volume Limiting (Not In-Memory)

Decision: Outbound volume rate limiting uses Redis, not in-memory counters.

Why: The API runs as multiple ECS tasks behind a load balancer. In-memory counters are per-process and don't aggregate across instances. Redis provides a shared counter so the volume limit is enforced globally.

Graceful degradation: If Redis is unavailable, volume limiting is disabled (fail-open). This is intentional — we'd rather send email than block sends because Redis is down. The per-request rate limit (in-memory governor) still applies as a safety net.

Auth Middleware Pattern (`from_fn_with_state`)

Decision: Use Axum's from_fn_with_state middleware for blanket route protection, not FromRequestParts extractors.

Why: FromRequestParts is more idiomatic for per-route auth, but from_fn_with_state is simpler for blanket protection of all routes. Middleware allows easy exclusion of the health check without per-handler changes:

rust

let protected_routes = Router::new()
    .route("/send", post(...))
    .route_layer(middleware::from_fn_with_state(state, require_api_key));

let public_routes = Router::new()
    .route("/health", get(health_check));

Router::new().merge(protected_routes).merge(public_routes)

Request Rate Limiting (governor over tower-governor)

Decision: Use the governor crate directly with DashMap for per-key rate limiting, not tower-governor.

Why:

tower-governor requires complex key extractor setup
Direct governor integration is simpler and matches the existing from_fn_with_state middleware pattern
DashMap over RwLock<HashMap> — lower contention for concurrent per-key access, no global lock on read operations, built-in entry API for atomic get-or-insert
Lazy limiter creation — only create limiters for active API keys, memory efficient
Rate limit not consumed on auth failure — middleware runs after auth, so invalid API key requests never reach the rate limiter (prevents DoS amplification via rate limit exhaustion)

Webhook Emission via Processor Wrapping

Decision: Webhook events are emitted by wrapping existing processors (IngestProcessorWithWebhooks, TelemetryProcessorWithWebhooks) rather than modifying them.

Why: The original processors (IngestProcessor, TelemetryProcessor) remain unchanged and fully tested. Extended versions wrap the original, adding webhook emission after successful processing. This maintains backward compatibility with existing tests and keeps the webhook concern separate from the core processing logic. Emission doesn't block the pipeline — failures are logged but don't fail the main operation.

Architecture Decisions ​

Hexagonal Architecture (Ports & Adapters) ​

Customer-Signed JWTs (Not Server-Issued) ​

Queued Outbound Delivery (Not Inline SES) ​

Soft Deletes Everywhere ​

7-Day Subject Matching Window ​

SQS Over Direct Processing ​

Distroless Container Images ​

Single PostgresRepository (Not Per-Entity) ​

ClamAV for Malware (Not Spam Scoring) ​

Redis for Volume Limiting (Not In-Memory) ​

Auth Middleware Pattern (from_fn_with_state) ​

Request Rate Limiting (governor over tower-governor) ​

Webhook Emission via Processor Wrapping ​

Architecture Decisions

Hexagonal Architecture (Ports & Adapters)

Customer-Signed JWTs (Not Server-Issued)

Queued Outbound Delivery (Not Inline SES)

Soft Deletes Everywhere

7-Day Subject Matching Window

SQS Over Direct Processing

Distroless Container Images

Single PostgresRepository (Not Per-Entity)

ClamAV for Malware (Not Spam Scoring)

Redis for Volume Limiting (Not In-Memory)

Auth Middleware Pattern (`from_fn_with_state`)

Request Rate Limiting (governor over tower-governor)

Webhook Emission via Processor Wrapping