Tailscale-Down Contingency Runbook

Overview

Tailscale is Tractorbeam's corporate network overlay providing secure access to internal services, DNS filtering (via NextDNS), and exit node egress for Okta network zone trust. A Tailscale outage affects multiple systems simultaneously.

Use this runbook when:

Tailscale coordination server is unreachable (status.tailscale.com shows incident)
Exit nodes are unhealthy or unreachable
Devices cannot establish or maintain Tailscale connections
DNS resolution failures due to NextDNS/Tailscale DNS integration

Impact Assessment

Direct Dependencies

Service	Dependency	Impact if Tailscale Down
Okta network zones	Exit node EIPs for trusted network	Users outside office may lose SSO access or get step-up MFA prompts
NextDNS	Tailscale DNS override routes all queries through NextDNS	DNS resolution may fall back to local resolvers (no filtering)
Fleet MDM server	Runs in shared-services VPC, accessible via Tailscale	Cannot access Fleet admin UI or push MDM commands
Tailscale SSH	Exit nodes use Tailscale SSH	No SSH access to exit node instances
Internal services	Any service exposed only via Tailscale MagicDNS	Inaccessible until Tailscale recovers

Indirect Dependencies

CI/CD: GitHub Actions workflows that interact with the shared-services account are not affected (they use OIDC, not Tailscale)
AWS console/CLI: Unaffected — uses Okta SSO via IAM Identity Center, independent of Tailscale
Google Workspace: Unaffected — direct Okta SAML, no Tailscale dependency
Cloudflare DNS: Unaffected — public DNS, no Tailscale dependency

Detection

Symptoms

Multiple team members report "Unable to connect" or "Tailscale is stopped" on their devices
Okta sign-ins from corporate devices start requiring step-up MFA (exit node EIPs no longer seen)
DNS queries fail or resolve differently than expected (NextDNS filtering lost)
Fleet dashboard shows devices as offline

Monitoring

Tailscale status page: https://status.tailscale.com
NextDNS status: https://nextdns.io (check if queries are flowing in Analytics tab)
Exit node health: Check ASG instance health in AWS console (shared-services account)

bash

# Check exit node ASG health
AWS_PROFILE=shared-services aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names tractorbeam-exit-node \
  --query 'AutoScalingGroups[0].Instances[*].[InstanceId,HealthStatus,LifecycleState]' \
  --output table

Immediate Response

1. Determine Scope

Is this a Tailscale-side outage or a local infrastructure issue?

Tailscale coordination server down: Check https://status.tailscale.com. If Tailscale reports an incident, existing connections may persist (WireGuard is peer-to-peer) but new connections and key renewals will fail.
Exit node instances down: Check ASG health (above). If instances are unhealthy, the ASG will auto-replace them. If both exit nodes are down simultaneously, Okta network zone trust breaks.
DNS only: If Tailscale works but DNS doesn't, the issue may be NextDNS-specific. Check https://my.nextdns.io for the Tractorbeam profile status.

2. Communicate

Notify the team in Slack #engineering:

Tailscale is currently experiencing issues. [Brief description]. We're investigating. If you're having trouble accessing internal services or getting unexpected MFA prompts, this is the cause. Stand by for updates.

3. Preserve Existing Connections

If Tailscale coordination is down but WireGuard tunnels are still up:

Do NOT restart Tailscale on devices that are currently connected — existing tunnels survive coordination outages
Do NOT cycle exit node instances unless they are actually unhealthy

Workarounds During Outage

Okta Access (Step-Up MFA)

If exit node EIPs are unreachable and Okta requires step-up MFA:

Users authenticate normally but will get additional MFA prompts (not blocked, just more friction)
If Okta network zone policies deny access entirely from untrusted networks, temporarily relax the policy:
1. Log into Okta admin console (https://tractorbeam-admin.okta.com)
2. Security > Authentication Policies
3. Temporarily modify the policy to allow access from any network with MFA
4. Revert immediately when Tailscale recovers

DNS Resolution

If NextDNS is unreachable via Tailscale DNS override:

Devices will not automatically fall back to local DNS when override_local_dns is enabled
To restore DNS on a device temporarily:

bash

# macOS: temporarily override DNS
sudo networksetup -setdnsservers Wi-Fi 1.1.1.1 8.8.8.8

This bypasses NextDNS filtering — acceptable for an outage but revert when resolved

Fleet MDM

Fleet runs in the shared-services VPC behind an ALB. During a Tailscale outage:

Enrolled devices continue to check in via the public Fleet endpoint (fleet.tractorbeam.ai)
Admin UI access requires Tailscale or direct VPC access
For urgent MDM actions, use AWS SSM Session Manager to access the Fleet ECS task directly:

bash

AWS_PROFILE=shared-services aws ecs execute-command \
  --cluster tractorbeam-fleet-cluster \
  --task <task-id> \
  --container fleet \
  --interactive \
  --command "/bin/sh"

AWS Infrastructure Access

AWS console and CLI access via Okta SSO is unaffected by Tailscale outages. If Okta itself is also down, see the Break-Glass Emergency Access Runbook.

Recovery

When Tailscale Recovers

Verify coordination server connectivity:
bash
```
tailscale status
```

Check exit node health:

bash

AWS_PROFILE=shared-services aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names tractorbeam-exit-node \
  --query 'AutoScalingGroups[0].Instances[*].[InstanceId,HealthStatus]' \
  --output table

Verify DNS filtering is active: Visit https://my.nextdns.io and check the Analytics tab for recent queries from team devices.
Revert any workarounds:
- Restore Okta authentication policies to original state
- Remove manual DNS overrides on devices (sudo networksetup -setdnsservers Wi-Fi empty)
- Confirm Tailscale DNS override is routing through NextDNS
Notify team:
Tailscale has recovered. All services are back to normal. If you made any local DNS changes during the outage, please restart Tailscale to restore NextDNS filtering.

If Exit Nodes Need Replacement

The ASG handles this automatically, but if manual intervention is needed:

bash

# Force instance refresh
AWS_PROFILE=shared-services aws autoscaling start-instance-refresh \
  --auto-scaling-group-name tractorbeam-exit-node \
  --preferences MinHealthyPercentage=0

New instances will bootstrap with Tailscale via cloud-init, self-associate their EIPs, and register as exit nodes automatically.

Post-Incident

Document the incident: timeline, impact, actions taken
If Tailscale-side outage, no action needed — monitor their post-mortem
If infrastructure-side, investigate root cause and consider:
- Additional exit node capacity (currently 2 across AZs)
- Health check improvements
- Alerting gaps (did we detect this quickly enough?)
Update this runbook with any lessons learned

Tailscale-Down Contingency Runbook ​

Overview ​

Impact Assessment ​

Direct Dependencies ​

Indirect Dependencies ​

Detection ​

Symptoms ​

Monitoring ​

Immediate Response ​

1. Determine Scope ​

2. Communicate ​

3. Preserve Existing Connections ​

Workarounds During Outage ​

Okta Access (Step-Up MFA) ​

DNS Resolution ​

Fleet MDM ​

AWS Infrastructure Access ​

Recovery ​

When Tailscale Recovers ​

If Exit Nodes Need Replacement ​

Post-Incident ​

Tailscale-Down Contingency Runbook

Overview

Impact Assessment

Direct Dependencies

Indirect Dependencies

Detection

Symptoms

Monitoring

Immediate Response

1. Determine Scope

2. Communicate

3. Preserve Existing Connections

Workarounds During Outage

Okta Access (Step-Up MFA)

DNS Resolution

Fleet MDM

AWS Infrastructure Access

Recovery

When Tailscale Recovers

If Exit Nodes Need Replacement

Post-Incident