Architecture Decisions
This page documents the key design decisions in Mailman, what alternatives were considered, and why we chose what we did. If you're wondering "why does it work this way?", start here.
Hexagonal Architecture (Ports & Adapters)
Decision: Domain logic lives in core with zero I/O dependencies. All external interactions go through trait boundaries.
Why: Email infrastructure has a lot of external dependencies — Postgres, S3, SES, SQS, Redis, ClamAV. If domain logic is coupled to any of these, testing becomes slow and brittle, and swapping implementations requires touching business logic.
Rules:
corenever importssqlx,aws-sdk-*,reqwest, or any I/O crate- Repository traits are defined in
core/src/repository.rs, implemented inadapters-*crates - Route handlers in
httpcall trait methods, never concrete adapter types - Binary crates (
bin-api,bin-worker) are the only place where concrete types are wired together
What this means in practice: When you write a new feature, the domain logic and the database query are in different crates. This feels like overhead for simple CRUD, but it pays off when:
- Integration tests run with real Postgres but mocked AWS (fast, no AWS costs)
- You need to change how S3 keys are structured without touching threading logic
- A new adapter (e.g., swapping ClamAV for a different scanner) requires zero changes to
core
What we rejected: Putting everything in one crate with feature flags for test mocking. This is simpler initially but makes it impossible to enforce boundaries — a single use aws_sdk_s3 in the wrong file breaks the architecture silently.
Customer-Signed JWTs (Not Server-Issued)
Decision: Customers generate their own JWTs with their private key and register the corresponding public key with Mailman.
Why: Mailman is infrastructure, not an identity provider. Customers already have auth systems. Issuing tokens would mean managing sessions, refresh flows, and password resets — all out of scope. The customer-signed model means:
- Mailman never sees private keys
- Token issuance is the customer's responsibility (their auth service, their CI/CD, etc.)
- Key rotation is a public key swap, not a coordinated secret rotation
- Multiple services can share a signing key or have separate keys per environment
Fallback: API keys are still supported for backward compatibility and simple integrations where JWT is overkill.
Queued Outbound Delivery (Not Inline SES)
Decision: POST /send stores the message and enqueues a delivery job to SQS. The worker does the actual SES send.
Why: Inline SES calls from the API handler create several problems:
- Latency: SES calls take 100-500ms, which makes the API slow for the caller
- Reliability: If SES is temporarily down, sends fail and the caller has to retry
- Observability: With queued delivery, every message has a tracked lifecycle (pending → sent → delivered/bounced)
- Rate control: The worker can enforce volume limits and back off on SES throttling without blocking API callers
Tradeoff: The caller gets 202 Accepted with status: "pending", not immediate confirmation. They must use webhooks or poll for delivery status.
Soft Deletes Everywhere
Decision: All entities use deleted_at TIMESTAMPTZ instead of DELETE FROM.
Why:
- Audit trail: Email systems need to answer "what happened to that message?" months later
- Recovery: Accidental deletes are recoverable without backups
- Referential integrity: Hard-deleting a thread while delivery events reference its messages creates orphans
- S3 retention: Raw MIME stays in S3 per retention policy; DB records should match
Cost: Every query needs WHERE deleted_at IS NULL. We use partial indexes on deleted_at IS NULL to keep this fast.
7-Day Subject Matching Window
Decision: Subject-based thread fallback only matches messages from the last 7 days.
Why: Without a time window, unrelated emails with common subjects ("Hello", "Quick question", "Follow up") would incorrectly merge into ancient threads. 7 days covers typical reply cadence while preventing false matches.
Why not configurable? We considered making it configurable per inbox but decided against it — the complexity isn't worth it. If a customer needs different behavior, they should use proper In-Reply-To / References headers, which don't have a time window.
SQS Over Direct Processing
Decision: All async work goes through SQS queues, not direct function calls or database polling.
Why:
- Reliability: SQS guarantees at-least-once delivery. If the worker crashes mid-processing, the message becomes visible again after the visibility timeout
- Backpressure: Queue depth is a natural metric for scaling. Worker auto-scales on queue depth
- Decoupling: The API doesn't need to know if the worker is running. Messages queue up and drain when the worker recovers
- Dead letter queues: Failed messages automatically land in the DLQ after max retries, preventing poison pill loops
Exception: Webhook delivery uses database polling (not SQS) because webhooks need per-thread ordering (FIFO) and complex retry scheduling that's easier to express in SQL.
Distroless Container Images
Decision: Runtime images use gcr.io/distroless/cc-debian12:nonroot, not Alpine or full Debian.
Why:
- Security: No shell, no package manager, no utilities an attacker could use
- Size: ~20MB base vs ~120MB for Debian slim
- Compatibility: Our binaries link against glibc (via
native-tls). Alpine uses musl, which would require recompilation and has known compatibility issues with some crates
Tradeoff: You can't docker exec into the container for debugging. Use CloudWatch logs and ECS Exec instead.
Single PostgresRepository (Not Per-Entity)
Decision: One PostgresRepository struct implements all repository traits, rather than separate MessageRepo, ThreadRepo, etc.
Why: Having 10+ repository structs that all wrap the same PgPool adds boilerplate with no benefit. A single struct with trait implementations gives:
- One place to manage the connection pool
- Clean trait boundaries (callers only see the trait they need)
- Less wiring in the binary entrypoints
ClamAV for Malware (Not Spam Scoring)
Decision: Scan for malware with ClamAV. Do not score spam.
Why: Mailman is not a user-facing inbox — it's API infrastructure. The customer's application decides what to do with messages. Malware scanning protects our storage (don't persist infected files), but spam scoring is the customer's concern. Adding spam scoring would mean maintaining rules, tuning thresholds, and handling false positives — all outside our scope.
Graceful degradation: ClamAV is optional. If not configured, a noop scanner is used. This lets environments without ClamAV (local dev, CI) work without mocking.
Redis for Volume Limiting (Not In-Memory)
Decision: Outbound volume rate limiting uses Redis, not in-memory counters.
Why: The API runs as multiple ECS tasks behind a load balancer. In-memory counters are per-process and don't aggregate across instances. Redis provides a shared counter so the volume limit is enforced globally.
Graceful degradation: If Redis is unavailable, volume limiting is disabled (fail-open). This is intentional — we'd rather send email than block sends because Redis is down. The per-request rate limit (in-memory governor) still applies as a safety net.
Auth Middleware Pattern (from_fn_with_state)
Decision: Use Axum's from_fn_with_state middleware for blanket route protection, not FromRequestParts extractors.
Why: FromRequestParts is more idiomatic for per-route auth, but from_fn_with_state is simpler for blanket protection of all routes. Middleware allows easy exclusion of the health check without per-handler changes:
let protected_routes = Router::new()
.route("/send", post(...))
.route_layer(middleware::from_fn_with_state(state, require_api_key));
let public_routes = Router::new()
.route("/health", get(health_check));
Router::new().merge(protected_routes).merge(public_routes)Request Rate Limiting (governor over tower-governor)
Decision: Use the governor crate directly with DashMap for per-key rate limiting, not tower-governor.
Why:
tower-governorrequires complex key extractor setup- Direct
governorintegration is simpler and matches the existingfrom_fn_with_statemiddleware pattern DashMapoverRwLock<HashMap>— lower contention for concurrent per-key access, no global lock on read operations, built-in entry API for atomic get-or-insert- Lazy limiter creation — only create limiters for active API keys, memory efficient
- Rate limit not consumed on auth failure — middleware runs after auth, so invalid API key requests never reach the rate limiter (prevents DoS amplification via rate limit exhaustion)
Webhook Emission via Processor Wrapping
Decision: Webhook events are emitted by wrapping existing processors (IngestProcessorWithWebhooks, TelemetryProcessorWithWebhooks) rather than modifying them.
Why: The original processors (IngestProcessor, TelemetryProcessor) remain unchanged and fully tested. Extended versions wrap the original, adding webhook emission after successful processing. This maintains backward compatibility with existing tests and keeps the webhook concern separate from the core processing logic. Emission doesn't block the pipeline — failures are logged but don't fail the main operation.