Runbook
Operational procedures for managing the Mailman service.
Health Checks
API Health
bash
curl https://api.mail.example.com/health
# → {"status":"ok"}The health endpoint verifies database connectivity.
Worker Health
The worker has no HTTP endpoint. Monitor via:
- CloudWatch logs for processing activity
- SQS queue depth (should trend toward 0)
- DLQ message counts (should be 0)
Database Operations
Run Migrations
Migrations are embedded in the adapters-postgres crate and run automatically on startup via SQLx.
Check Migration Status
bash
psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations ORDER BY version;"Manual Query Examples
sql
-- Count messages by direction
SELECT direction, COUNT(*) FROM messages GROUP BY direction;
-- Check suppression list
SELECT * FROM suppressed_addresses ORDER BY created_at DESC LIMIT 20;
-- Find threads with most messages
SELECT t.id, t.subject, COUNT(m.message_id) as msg_count
FROM threads t JOIN messages m ON m.thread_id = t.id
GROUP BY t.id ORDER BY msg_count DESC LIMIT 10;
-- Check webhook endpoint health
SELECT url, status, failure_count, last_success_at, last_failure_at
FROM webhook_endpoints ORDER BY failure_count DESC;Queue Management
Check Queue Depth
bash
aws sqs get-queue-attributes \
--queue-url $INBOUND_QUEUE_URL \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisibleInspect DLQ Messages
bash
aws sqs receive-message \
--queue-url $DLQ_URL \
--max-number-of-messages 1 \
--message-attribute-names AllReprocess DLQ Messages
Move messages from DLQ back to the main queue for reprocessing:
bash
aws sqs start-message-move-task \
--source-arn $DLQ_ARN \
--destination-arn $QUEUE_ARNTroubleshooting
Messages Not Processing
- Check worker logs for errors
- Check inbound SQS queue depth — if growing, worker may be down or slow
- Check DLQ — messages there indicate processing failures
- Check SES receipt rules are active
Delivery Failures
- Check
delivery_eventstable for the message's SES message ID - Check suppression list — recipient may be suppressed
- Check domain verification status — DKIM must be verified for sending
- Check SES sending limits in AWS console
Webhooks Not Firing
- Check
webhook_endpointstable — endpoint may be disabled (failure_count >= 10) - Check
webhook_deliveriestable for pending deliveries - Verify the endpoint URL is reachable and returning 2xx
- Check worker logs for webhook dispatch errors
Domain Verification Stuck
- Verify DNS records are correctly configured
- Check the domain verification loop is running (worker logs)
- Manually check SES identity status:bash
aws ses get-identity-verification-attributes --identities mail.acme.com aws ses get-identity-dkim-attributes --identities mail.acme.com
Emergency Procedures
Pause Inbound Processing
Set the inbound SQS queue's visibility timeout to a very high value (e.g., 43200 seconds / 12 hours). Messages will remain in the queue but not be delivered to the worker.
Pause Outbound Delivery
Same approach with the outbound SQS queue.
Disable a Webhook Endpoint
sql
UPDATE webhook_endpoints SET failure_count = 999 WHERE id = 'endpoint-uuid';This effectively disables the endpoint (threshold is 10).