Overview

At 12:02pm ET, all microcosm services on the cozy network (phil's place) went offline. Constellation was brought back online via cellular network connection (phone tethering) by 12:29, for 27 minutes of total downtime.

The UFOs API, Spacedust interactions firehose, and the notifications demo and its auth proxy were intentionally left offline until the primary network could be fully restored, which happened at 2:24pm Aug 19; 26h22m after the disruption began.

Microcosm services outside the cozy network were not affected: Spacedust record and identity cache, Montreal relay, Montreal jetstream, France relay, and Germany jetstream all remained online for the duration.

Internally: the victoriametrics server is located in the cozy network, so it was unable to scrape metrics from spacedust or the firehoses for the full duration. It did keep scraping some (pretty useless) metrics from other cozy network services. The gateway ingress proxy hosted on DigitalOcean remained healthy itself, but could only return HTTP 502s since it couldn't reach the cozy network.

Initial response

  • debugged home internet

    • confirmed it was down for multiple devices

    • found that the modem was showing an error for lack of fibre signal

    • could not find any indication of a wider ISP outage

    • restarted modem (no improvement)

  • switched constellation to backup cellular network

    • constellation reconnection was delayed by 12 additional minutes because the pi continued trying to route connections via ethernet.

      • went down a wrong path for a bit because the raspberry pi wifi caused sporadic issues after the last ISP downtime, and i was worried the manual wifi-disabling actions i took might be interfering with the connection. this wouldn't make sense since Constellation has been freshly reprovisioned on new hardware and a fresh OS install since that incident.

    • simply unplugging the physical network cable from the pi resolved the issue, and got it to use the tethered connection 🤦‍♀️.

  • did a physical inspection of all accessible fibre cabling from the modem out to the street.

  • scheduled a technician visit from my ISP at the earliest available time (the next day).

  • decided to only connect constellation to the backup network

    • other services get significantly less public API traffic

    • conserving cellular bandwidth, only one atproto firehose

    • expected to restore network within the upstream replay window

Root cause

  • Somebody cut the fibre optic cable that runs to my house at the neighbourhood junction box.

Recovery

  • While the cellular tether kept the API accessible, it turned out not to keep the jetstream connection alive.

    • Constellation uses earlier jetstream consumer code than the other microcosm services, which can fail to detect some kinds of signal loss, preventing it from auto-reconnecting.

    • Constellation's consumer is a little slow when doing jetstream replay, extending the time to full recovery after its connection was reactivated.

  • Spacedust currently only live-tails the firehose, so it has no recovery period (downstream notifications can be lost, this is expected, it's just a demo).

  • Once reconnected, UFOs' recent database improvements allowed it to max out its jetstream replay to recover as quickly as literally possible, with zero stalls!

Extended Impact

  • Full connectivity was restored for all firehose consumers within upstream replay windows, so no data was lost.

  • The who-am-i auth proxy for the notifications demo failed to boot properly after pulling its USB power while debugging ISP connectivity. Instead of fixing it, the notifications demo will be migrated to handling its own atproto oauth directly.

Actions

  • Write a runbook for moving Constellation to a tethered cellular connection to avoid additional debugging delays.

  • Update constellation's jetstream consumer to consider a connection dead of no new messages are received within a timeout.

  • Execute on the failover planning already in progress: add DNS host monitoring, a second gateway, and a second Constellation instance in a different physical location, to eliminate as many single points of failure as possible.

  • Get a UPS battery backup system.

Future planning

  • Build a local jetstream fan-out service. Only constellation was allowed on the cellular backup because tethering all services would have tripled the bandwidth. With local fanout, enabling each service would only add its API traffic to the total bandwidth.

  • Improve the write path for constellation for faster replay recovery in the event of downtime (and more firehose event volume growth margin).

Wins

  • UFOs' database improvements really proved themselves! (see Recovery above).

  • The ISP technician left me a long fibre optic patch cable, which made it possible for me to relocate the modem next to the rest of the physical cozy network!


Appendix

Constellation failover: prior planning

The risk of single-failure outages for Constellation has been clear for a while! Earlier this month I started planning for some redundancy:

I'll document more about this plan as I'm able to move on it, but the short version is: a full second stack of everything. The second public gateway instance will be from a separate VPS provider, the second physical server will be in a different location, etc. The primary single failure points will be the DNS provider and the ISP (unfortunately both homelabs use the same ISP). ISP outages spanning multiple locations are rare, but they did already happen once this year.