How Streamplace Works: The Indexer

Kappa Architecture means this: stuff happens in the world. What are you gonna do about it?

September 02, 2025

The fundamental principle of decentralization is that user actions are sovereign.

Modern decentralized platforms tend to implement this with cryptographic signing. The base primitives are signed and self-certifying; they're associated with a cryptographic identity that exists outside of any particular company or provider. The magic of signing means that you can't falsify the content of, say, this blog. Change a single letter in this pub.leaflet.document record and all the ATProto systems involved will immediately reject it as invalid. Only I can change the content, because I have my signing keys and can make a new signature.

There's more than one way to make that work. SSB, Nostr, and blockchains include a little user signature right there next to every action. ATProto has you sign the root of a Merkle tree representing the entire universe of things that you've done in the ATProto universe. Regardless, these all pass the test.

(This, by the way, is where ActivityPub fails the decentralization test, provider diversity notwithstanding. Sovereignty for user actions lies in the ActivityPub server, not in the user themselves; there's no cryptographic identity external to the node. If the humans hosting your Mastodon node decide they hate you, they can delete your social graph and you're starting over from scratch with no content. I've yet to see anyone from that community make the argument that their architecture makes it impossible to host cryptocurrencies, but they totally should! It's a great argument for never using signed primitives and keypair-based identities for anything.)

The Old Way

The basic way that they'll teach you to write a web service involves some kind of tiny stateless JSON API server in front of a database. Perhaps you're making a blogging website with a little Node.js server in front of a Postgres database. The web server handles authentication, parsing, and "business logic", and the database takes care of holding all of your data and indexing it for efficient retrieval. When one of your users makes a change to their blog entry, you update the corresponding entry in the database.

Empires have been built on this basic architecture, but it fails the decentralization test. If the database administrator goes in and modifies your blog post to contain a bunch of profanity, there's no external way to prove you didn't do it yourself. Sovereignty lies with the database administrator, not with the user.

On ATProto, as with all decentralized networks, things necessarily work a bit differently. Users publish documents to their PDS, and their PDSses are all over the world. Whenever they make a change to their data, such as creating or updating a blog post, that then gets published and pushed out (via com.atproto.sync.subscribeRepos). Relays amalgamate these streams from all PDSses and publish a firehose of data. To implement your blogging platform, you now need to listen for user events on that firehose; ignore most of the data but build up an index of the blog posts that you care about for efficiently serving them on websites like blog.stream.place.

lol kappa

"Building an index over on a constant stream of immutable user actions" is broadly known as Kappa Architecture, and it exists outside of decentralized networks too; this kind of pattern is also extremely common in big corporations using Apache Kafka. It's also how blockchains operate — in the web3 cinematic universe there's a great project called The Graph that lets you build arbitrary indices on Ethereum-compatible blockchains.

It's also my favorite way to build software in general. The magic of it lies in the separation of the user actions from the index. You make extra sure that the user actions are persisted, and then build a rich, sophisticated data model on top of them. When it comes time to run a database migration, adding new fields to your index, you make the changes to your software and then delete your production database. Your software then proceeds to replay all of the user actions from the dawn of time, building up a new index.

So that's how Streamplace does it. Mostly. There are, unfortunately, some exceptions. I'll get into it.

Oh, ATProto folks tend to call these indices "AppViews". Which is fine. That's what one of these systems is for, providing a data and query layer for user-facing applications. But what the system is is an index.

State and Statelessness

The primary Streamplace index database uses SQLite (with Gorm as an Go ORM around it) and operates essentially how I describe it above. In fact, we have lots of these databases — every single public Streamplace node is listening to the ATProto firehose and building up its own picture of the universe. That's not the only way to do it (Bluesky apparently has a whole proprietary ScyllaDB data layer for backing their AppView) but it'll work for us for the foreseeable future.

This database primarily deals with storing ATProto records in their original CBOR format, which makes writing code with these things easy and efficient — code everywhere in the app can deal with the same streamplace.Livestream or comatproto.FeedPost structs whether they're coming from the firehose or from our internal database. The only other fields that we add on the SQLite table for each collection are for indices — for example if we want to index a app.bsky.graph.follow record both by follower and follow-ee.

type Livestream struct {
	URI        string    `json:"uri" gorm:"primaryKey;column:uri"`
	CID        string    `json:"cid" gorm:"column:cid"`
	CreatedAt  time.Time `json:"createdAt" gorm:"column:created_at;index:idx_repo_created,priority:2"`
	Record     *[]byte   `json:"livestream"`
	RepoDID    string    `json:"repoDID" gorm:"column:repo_did;index:idx_repo_created,priority:1"`
	Repo       *Repo     `json:"repo,omitempty" gorm:"foreignKey:DID;references:RepoDID"`
	Post       *FeedPost `json:"post,omitempty" gorm:"foreignKey:CID;references:PostCID"`
	PostCID    string    `json:"postCID" gorm:"column:post_cid"`
	PostURI    string    `json:"postURI" gorm:"column:post_uri;index:idx_post_uri"`
}

See, the title of your livestream isn't actually in our database! It's in that Record blob, and we pull it out of the CBOR when we need it.

One of the tricks that lets us do this efficiently is that we don't start caring about users until they interact with Streamplace for the first time. Only once a user logs in to a Streamplace node do we start indexing actions for that user — first by importing the users' entire repository with com.atproto.sync.getRepo and then following actions from there with the firehose. Hopefully this will make Streamplace nodes easier to run — unlike operating your own Bluesky AppView, which necessarily builds an index of the entire universe, Streamplace nodes only care about the users that interact with them. This is facilitated by the architecture of ATProto — instead of there being one single blockchain to index, there are millions of user-specific repositories that can be indexed in parallel.

This trick won't scale forever; eventually (hopefully!) Streamplace's continued success will get to the point where it's infeasible to index all Streamplace users on a SQLite database on a single node. At that point we'll either need to follow Bluesky's example and implement a proper dataplane for backing the index, or else do some kind of sharding where each Streamplace node maintains an index over some subset of the users. We'll see!

Statefulness (derogatory)

As I mentioned, there are some exceptions, for which we recently introduced a Postgres database.

Eli Mallon

@iame.li

if you see a postgres database, somebody fucked up

I'm sad about this, because it feels counter to the ideal of PDS sovereignty. I don't want Streamplace to be holding on to your data, I want YOU to be holding on to your data!

I acknowledge that it's unavoidable for a few things. OAuth credentials are a good example; we need to authenticate with the PDS somehow. And a cluster of Streamplace nodes needs to do a few things internally to manage processing its internal queue and such, that's acceptable. There's also the video data itself, which is big and complicated and mandates its own post. But there's a huge category that's currently in our database and doesn't need to be, which is user secrets.

User Secrets

Streamplace includes a Discord webhook integration. If you're in the Streamplace Discord (👀), you've seen these posts:

Screenshot of the Streamplace Discord #livestreams channel. @iame.li says LIVE knocking out some BUGS I think @coding stream enjoyers

Discord webhooks work by providing a secret URL that allows software like Streamplace to send in arbitrary APP messages. So for now, we store that in our stateful Postgres database on behalf of our users. We also do this for mobile push notification tokens via Firebase.

This seems kinda easy to solve: we need a way for the PDS to store secret data on behalf of its users. This is a lot easier than solving the full "private data" problem in ATProto where users want private accounts in a way that breaks the entire PDS --> Relay --> AppView architecture. This is a user secret which will only ever by shared with applications that authorize it, and never with other users. This eliminates almost all of the "private data" complexity.

It's so easy to solve, in fact, that the Bluesky team already solved it for themselves with the app.bsky.actor.getPreferences endpoint which does exactly what I describe. But its use is discouraged, and it suffers from being a single array value where everything gets clobbered every time you modify it.

{
  "preferences": [
    {
      "$type": "app.bsky.place.stream.preferences.example",
      "iLoveIceCream": "yeah"
    }
  ]
}

But you can use it for whatever you want as long as you're willing to lie and say you're app.bsky, lol.

But I think this design is almost right! Just turn this into a key-value store and you've created exactly what we need. Or, if you want to maintain the ATProto semantics, have it be a separate repository — that way everything stored in it should conform to the lexicon shapes and whatnot. You can even go one part further and add an OAuth scope for accessing parts of this repository so only Streamplace has access to your place.stream.webhook.discord records. Neat!

I just think it'd be a shame to wait until "private data" is fully solved before tackling this one. The hard part about Twitter-style private accounts is figuring out all the data syncing semantics without using the firehose. User secrets, only ever shared with applications, dodge that complexity. Let's just do one more iteration on top of app.bsky.actor.getPreferences and be done with it?

Toward fully stateless apps

I really struggle to convey this ecosystem to regular folks — getting them to understand that "Log in with ATProto" is something fundamentally different, cooler, and better than "Log in with Google". "Log in with Google" is showing your passport to check in at the front desk of a hotel; "Log in with ATProto" is an RV with all your stuff in it that fits in your pocket. Except right now it still uses the front desk of the hotel for a couple of things. Let's fix those and we can move closer to the dream of a world where your data belongs to you, and applications just access it.

Thanks for reading! Streamplace is building decentralized livestreaming for decentralized social networks — if this kind of thing is exciting to you, check our jobs page:

Streamplace Career Opportunities

To Apply: Send your resume and a brief introduction to jobs@stream.place

https://jobs.stream.place/

And join the Discord:

Join the Streamplace Discord Server!

Solving video for everybody, forever | 323 members

https://discord.stream.place/

Get updates from How Streamplace Works!