Skip to content

Tailscale-Down Contingency Runbook

Overview

Tailscale is Tractorbeam's corporate network overlay providing secure access to internal services, DNS filtering (via NextDNS), and exit node egress for Okta network zone trust. A Tailscale outage affects multiple systems simultaneously.

Use this runbook when:

  • Tailscale coordination server is unreachable (status.tailscale.com shows incident)
  • Exit nodes are unhealthy or unreachable
  • Devices cannot establish or maintain Tailscale connections
  • DNS resolution failures due to NextDNS/Tailscale DNS integration

Impact Assessment

Direct Dependencies

ServiceDependencyImpact if Tailscale Down
Okta network zonesExit node EIPs for trusted networkUsers outside office may lose SSO access or get step-up MFA prompts
NextDNSTailscale DNS override routes all queries through NextDNSDNS resolution may fall back to local resolvers (no filtering)
Fleet MDM serverRuns in shared-services VPC, accessible via TailscaleCannot access Fleet admin UI or push MDM commands
Tailscale SSHExit nodes use Tailscale SSHNo SSH access to exit node instances
Internal servicesAny service exposed only via Tailscale MagicDNSInaccessible until Tailscale recovers

Indirect Dependencies

  • CI/CD: GitHub Actions workflows that interact with the shared-services account are not affected (they use OIDC, not Tailscale)
  • AWS console/CLI: Unaffected — uses Okta SSO via IAM Identity Center, independent of Tailscale
  • Google Workspace: Unaffected — direct Okta SAML, no Tailscale dependency
  • Cloudflare DNS: Unaffected — public DNS, no Tailscale dependency

Detection

Symptoms

  • Multiple team members report "Unable to connect" or "Tailscale is stopped" on their devices
  • Okta sign-ins from corporate devices start requiring step-up MFA (exit node EIPs no longer seen)
  • DNS queries fail or resolve differently than expected (NextDNS filtering lost)
  • Fleet dashboard shows devices as offline

Monitoring

bash
# Check exit node ASG health
AWS_PROFILE=shared-services aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names tractorbeam-exit-node \
  --query 'AutoScalingGroups[0].Instances[*].[InstanceId,HealthStatus,LifecycleState]' \
  --output table

Immediate Response

1. Determine Scope

Is this a Tailscale-side outage or a local infrastructure issue?

  • Tailscale coordination server down: Check https://status.tailscale.com. If Tailscale reports an incident, existing connections may persist (WireGuard is peer-to-peer) but new connections and key renewals will fail.
  • Exit node instances down: Check ASG health (above). If instances are unhealthy, the ASG will auto-replace them. If both exit nodes are down simultaneously, Okta network zone trust breaks.
  • DNS only: If Tailscale works but DNS doesn't, the issue may be NextDNS-specific. Check https://my.nextdns.io for the Tractorbeam profile status.

2. Communicate

Notify the team in Slack #engineering:

Tailscale is currently experiencing issues. [Brief description]. We're investigating. If you're having trouble accessing internal services or getting unexpected MFA prompts, this is the cause. Stand by for updates.

3. Preserve Existing Connections

If Tailscale coordination is down but WireGuard tunnels are still up:

  • Do NOT restart Tailscale on devices that are currently connected — existing tunnels survive coordination outages
  • Do NOT cycle exit node instances unless they are actually unhealthy

Workarounds During Outage

Okta Access (Step-Up MFA)

If exit node EIPs are unreachable and Okta requires step-up MFA:

  • Users authenticate normally but will get additional MFA prompts (not blocked, just more friction)
  • If Okta network zone policies deny access entirely from untrusted networks, temporarily relax the policy:
    1. Log into Okta admin console (https://tractorbeam-admin.okta.com)
    2. Security > Authentication Policies
    3. Temporarily modify the policy to allow access from any network with MFA
    4. Revert immediately when Tailscale recovers

DNS Resolution

If NextDNS is unreachable via Tailscale DNS override:

  • Devices will not automatically fall back to local DNS when override_local_dns is enabled
  • To restore DNS on a device temporarily:
bash
# macOS: temporarily override DNS
sudo networksetup -setdnsservers Wi-Fi 1.1.1.1 8.8.8.8
  • This bypasses NextDNS filtering — acceptable for an outage but revert when resolved

Fleet MDM

Fleet runs in the shared-services VPC behind an ALB. During a Tailscale outage:

  • Enrolled devices continue to check in via the public Fleet endpoint (fleet.tractorbeam.ai)
  • Admin UI access requires Tailscale or direct VPC access
  • For urgent MDM actions, use AWS SSM Session Manager to access the Fleet ECS task directly:
bash
AWS_PROFILE=shared-services aws ecs execute-command \
  --cluster tractorbeam-fleet-cluster \
  --task <task-id> \
  --container fleet \
  --interactive \
  --command "/bin/sh"

AWS Infrastructure Access

AWS console and CLI access via Okta SSO is unaffected by Tailscale outages. If Okta itself is also down, see the Break-Glass Emergency Access Runbook.

Recovery

When Tailscale Recovers

  1. Verify coordination server connectivity:

    bash
    tailscale status
  2. Check exit node health:

    bash
    AWS_PROFILE=shared-services aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names tractorbeam-exit-node \
      --query 'AutoScalingGroups[0].Instances[*].[InstanceId,HealthStatus]' \
      --output table
  3. Verify DNS filtering is active: Visit https://my.nextdns.io and check the Analytics tab for recent queries from team devices.

  4. Revert any workarounds:

    • Restore Okta authentication policies to original state
    • Remove manual DNS overrides on devices (sudo networksetup -setdnsservers Wi-Fi empty)
    • Confirm Tailscale DNS override is routing through NextDNS
  5. Notify team:

    Tailscale has recovered. All services are back to normal. If you made any local DNS changes during the outage, please restart Tailscale to restore NextDNS filtering.

If Exit Nodes Need Replacement

The ASG handles this automatically, but if manual intervention is needed:

bash
# Force instance refresh
AWS_PROFILE=shared-services aws autoscaling start-instance-refresh \
  --auto-scaling-group-name tractorbeam-exit-node \
  --preferences MinHealthyPercentage=0

New instances will bootstrap with Tailscale via cloud-init, self-associate their EIPs, and register as exit nodes automatically.

Post-Incident

  1. Document the incident: timeline, impact, actions taken
  2. If Tailscale-side outage, no action needed — monitor their post-mortem
  3. If infrastructure-side, investigate root cause and consider:
    • Additional exit node capacity (currently 2 across AZs)
    • Health check improvements
    • Alerting gaps (did we detect this quickly enough?)
  4. Update this runbook with any lessons learned