Skip to content

Runbook

Operational procedures for managing the Mailman service.

Health Checks

API Health

bash
curl https://api.mail.example.com/health
# → {"status":"ok"}

The health endpoint verifies database connectivity.

Worker Health

The worker has no HTTP endpoint. Monitor via:

  • CloudWatch logs for processing activity
  • SQS queue depth (should trend toward 0)
  • DLQ message counts (should be 0)

Database Operations

Run Migrations

Migrations are embedded in the adapters-postgres crate and run automatically on startup via SQLx.

Check Migration Status

bash
psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations ORDER BY version;"

Manual Query Examples

sql
-- Count messages by direction
SELECT direction, COUNT(*) FROM messages GROUP BY direction;

-- Check suppression list
SELECT * FROM suppressed_addresses ORDER BY created_at DESC LIMIT 20;

-- Find threads with most messages
SELECT t.id, t.subject, COUNT(m.message_id) as msg_count
FROM threads t JOIN messages m ON m.thread_id = t.id
GROUP BY t.id ORDER BY msg_count DESC LIMIT 10;

-- Check webhook endpoint health
SELECT url, status, failure_count, last_success_at, last_failure_at
FROM webhook_endpoints ORDER BY failure_count DESC;

Queue Management

Check Queue Depth

bash
aws sqs get-queue-attributes \
  --queue-url $INBOUND_QUEUE_URL \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

Inspect DLQ Messages

bash
aws sqs receive-message \
  --queue-url $DLQ_URL \
  --max-number-of-messages 1 \
  --message-attribute-names All

Reprocess DLQ Messages

Move messages from DLQ back to the main queue for reprocessing:

bash
aws sqs start-message-move-task \
  --source-arn $DLQ_ARN \
  --destination-arn $QUEUE_ARN

Troubleshooting

Messages Not Processing

  1. Check worker logs for errors
  2. Check inbound SQS queue depth — if growing, worker may be down or slow
  3. Check DLQ — messages there indicate processing failures
  4. Check SES receipt rules are active

Delivery Failures

  1. Check delivery_events table for the message's SES message ID
  2. Check suppression list — recipient may be suppressed
  3. Check domain verification status — DKIM must be verified for sending
  4. Check SES sending limits in AWS console

Webhooks Not Firing

  1. Check webhook_endpoints table — endpoint may be disabled (failure_count >= 10)
  2. Check webhook_deliveries table for pending deliveries
  3. Verify the endpoint URL is reachable and returning 2xx
  4. Check worker logs for webhook dispatch errors

Domain Verification Stuck

  1. Verify DNS records are correctly configured
  2. Check the domain verification loop is running (worker logs)
  3. Manually check SES identity status:
    bash
    aws ses get-identity-verification-attributes --identities mail.acme.com
    aws ses get-identity-dkim-attributes --identities mail.acme.com

Emergency Procedures

Pause Inbound Processing

Set the inbound SQS queue's visibility timeout to a very high value (e.g., 43200 seconds / 12 hours). Messages will remain in the queue but not be delivered to the worker.

Pause Outbound Delivery

Same approach with the outbound SQS queue.

Disable a Webhook Endpoint

sql
UPDATE webhook_endpoints SET failure_count = 999 WHERE id = 'endpoint-uuid';

This effectively disables the endpoint (threshold is 10).