This is the first part in (hopefully) a series about how I'm building Hubble. In case you missed that announcement:
Hubble will be rolling-its-own atproto sync (spoiler!), and I wanted to write down why. The available sync tools are really excellent. Why do a lot of extra work??
This post is about creating atproto backend app servers which need:
Backfill (you care about past data)0
Completeness (you can't miss any data)0
It's actually great if your service doesn't need these: subscribe to jetstream and be on your way!
You also might not need them yet. Microcosm constellation is used by over 100 different atproto apps daily, and we're just finally starting to integrate sync. Jetstream can get you a long way!
Sync is low-level! Some atproto server-in-a-box projects like quickslice and HappyView do a lot more for you, they can be your entire backend data service. (Someone should write about where they do and don't shine so I can link it here!)
Sync
Feel free to skip ahead if you're already up on atproto.
Old-school centralized apps get to throw all their user-created data into a big central database. This works great for them! Since everything is in one place, you can answer broad questions:
The database is inside-out in atproto: instead of one big shared database, every single user gets their own, individual, mini-database1. Your mini-db, or "repo", can live anywhere on the internet, and your own data updates go to your own repo:
Backend app services aggregate allllll the repo data updates back into an old-school-style central database, so they can query the data they care about from one place again.
Sync is the mechanism for a backend app service to reliably fetch from and keep up-to-date with everyone's repos. It doesn't tell you what to do with the data, but it helps you get it, accurately.
Resync
The Sync spec says how to verify that a series of updates from one repo is perfect and accurate. But networks are unreliable, PDSes can be buggy, changes don't always fit in the firehose, migrations happen… what do you do when the sync proof fails? How do you resynchronize?
I think handling resync is the main benefit of sync utilities, and the core of what you need to solve if rolling your own. Even backfill is just some resyncs if you squint at it.
Resync in Tap
Let's illustrate the problem of resync, from Tap's perspective:
- 1.
Tap told the downstream app about a repo:
- 1.
Create record
Awith contentlorem - 2.
Create record
Bwithipsum - 3.
Delete record
A - 4.
Create record
Cwithdolor
So the current state of the mini-database is:
B => "ipsum"
C => "dolor"- 2.
Tap became desynchronized! Oh no! It re-fetches the whole repo, finding that it now contains:
A => "i'm back"
C => "dolor"
D => "amet"This is different than what Tap already told the downstream app about! Tap needs some way to
- 3.
Reconcile the difference for the downstream app, like:
- 3.
Create record
Awith contenti'm back - 4.
Delete record
B - 5.
Create record
Dwithamet
to get it back in sync.
Tap does this by maintaining a complete table of record contents2, covering every record in every collection it's tracking. If you're asking it to sync all 20-odd billion records in the whole network, this will be a 20-billion-and-growing-row table.
From this table, Tap can compare any full repo of records against what it has already told the downstream app, and generate those reconciling updates.
Other sync utilities work similarly, and if you roll-your-own, you have to figure out that reconciliation process too.
Should you roll your own sync?
Do you wanna? Yes? Then yeah! Stop reading and go do it!
For everyone else still here, let's break it down to three parts:
- 1.
Scale of data
- 2.
Data format (resync-friendliness)
- 3.
Other considerations: overhead, backup strategy, multi-relay
1. Scale of data
There are only two scales of atproto:
Big: you need Bluesky data3, or
Small: you don't4.
If you're small → probably use a sync utility! Rolling your own is unlikely to offer much advantage.
If big → you might find some advantage to rolling your own, but it's not necessarily required.
2. Data format (resync!)
Is your app's own data format resync-friendly? [see resync above]
If it's not easy to directly reconcile5 your app's data → probably use a sync utility. The utilities do the reconciliation, so you only need to handle a nice clean stream of create/update/delete events.
3. Other considerations
Storage overhead
Rolling your own sync might save disk space. Tap's SQL storage is uncompressed, and can use significant space and i/o.
Hydrant and Ramjet both use a storage engine with built-in compression support, but also both typically store full record contents. But also also! Both offer direct querying of those stored records, so whether it uses more space depends on how your app uses them, keeping redundant copies or querying the utility.
For small-scale data this doesn't matter much.
Backfill speed
Currently, Tap is one of the slower options for full-network backfill, at multiple days to complete. It will likely get better! You can probably get under 24h rolling your own (lower limit is 14 hours6), if your service can ingest content quickly. Hydrant can likely complete a full network backfill in under 24 hours.
Again, for small-scale data this doesn't matter so much.
Data backup / disaster recovery
Using a sync utility puts state in two places. Two data-loss scenarios worth considering:
- 1.
Your own app's database is lost
- 2.
The sync utility's database is lost
In either case, can you / how do you safely restore from a backup?
Tap first:
- 1.
To safely restore your app's database from a backup, you have to also revert Tap's database to a snapshot at same time or before. This might be tricky!
- 2.
Restoring Tap's database from backup is probably just fine!
Tap offers at-least-once event delivery semantics, so rolling Tap back should just cause some events to be re-emitted, which is something your app needs to handle anyway.
Hydrant:
- 1.
Restoring your app's database from backup is just fine with Hydrant! Your app will just replay from the earlier cursor and catch up.
- 2.
To safely restore Hydrant's database from backup, you have to also rewind your app's Hydrant cursor to at-or-before the last cursor in the Hydrant snapshot.
Rewinding a Hydrant cursor is fine provided your app handles at-least-once delivery semantics, like with Tap.
Ramjet:
I'm less familiar with it, but if you're able to enumerate the records your app knows about, Ramjet has set-reconciliation APIs for both scenarios. If not, I suspect it's similar to Hydrant.
Roll-your-own:
Unless you store state in multiple databases, restoring your app data from backup is probably just fine!
Note:
If you can atomically snapshot your app's own data along with your sync utility data, you simplify these disaster recovery scenarios. For example you might:
Run your app and Tap together on the same postgres or sqlite database.
Use Hydrant as a library, with all state kept together in its fjall database.
How bad is rolling-your-own?
It's a significant chunk of work! Full-network backfill is a big resource-management problem, firehose validation pulls in cryptography, you've got atproto identity resolution, binary file format handling, and on and on. Rolling your own might mean a lot less time for your actual app, if it's not core to what you're trying to do.
But it is doable! If you enjoy reading specifications, it can even be fun.
Some examples
Whew! Applying everything to some microcosm projects:
For Hubble: roll-your-own
- 1.
Scale of data: big (full-network, including Bluesky)
- 2.
Data format: resync-friendly!
- 3.
Other considerations: space efficiency is critical! strong disaster recovery story important.
Lightrail: roll-your-own
- 1.
Scale of data: big
- 2.
Data format: resync-friendly! (and in fact we store way less data that any of the utilities would, about 20GB for the whole network)
- 3.
Other considerations: space efficiency is a primary driver.
Hmm, microcosm projects are all full-network and biased toward roll-your-own. One nice non-microcosm project is
Superconnectors: Utility, probably!
- 1.
Scale of data: small! (just the youandme.at connections)
Rolling-your-own would be overkill. In fact, Business Goose actually went a step up from sync utilities, and used used quickslice for this, no custom backend service needed at all. Cool!
Wrapping up
There's a wide range of apps in between specialized small explorations and full-network indexes, and the landscape of sync tooling is changing quickly. Library-style sync utilities might soon make rolling-just-the-parts-you-need possible.
Sync itself isn't going to change quickly though, so hopefully the framework here will continue to make sense at least for a while.
Thanks for reading! Next time will be either about the STAR-lite repo archive format, or the original Hubble architecture idea that's pretty cool if I may say (but won't be what ships in Hubble v1)
Appendix: sync utilities
My high-level take on the current options, probably too brief to do any of them justice:
Tap
Bluesky's reference implementation of atproto sync is Tap. It just does sync: it handles all the tricky distributed-system problems of millions of mini-databases, and gives you a nice clean stream of data you care about.
Tap runs as a separate service, sending data to your app by websocket or webhooks. It stores its state in either sqlite (local file) or a postgres server.
Hydrant
A community-built alternative to Tap, which does more: it has some data storage and querying built-in, but doesn't go as far as the server-in-a-box options like quickslice and HappyView.
Hydrant can run as a separate service or built into any rust app as a library. It uses the fjall embedded storage engine (local files).
Hydrant can also run in ephemeral mode, retaining a limited subset of most-recent data. Resync in ephemeral mode re-emits all records from the repository: downstream apps will never miss record creates, but may, in edge cases, miss deletes.
Ramjet
Another independently-created sync tool. Ramjet has limited data querying built-in, and a novel mechanism for checking if your app's saved contents align with its own.
Ramjet runs as a separate service, and keeps state in fjall (local files).