Hot Take: Typing in distributed systems
Hot-Takes are quick brain-dumps on practices that catch my eye and my ire.
A topic that keeps nagging at me as I work on distributed systems is the proper usage of typing across separate components. On one hand, by enforcing strict typing at every layer, you have the potential to prevent changes to the structure of data in one component from propagating to other components as errors. On the other hand, you have a large amount of not-so-mature tooling and tradeoffs that cause a fragility that could otherwise be avoided.
As disparate components are cobbled together to create integrated systems providing reliable features, the transport of data between each component can largely be abstracted away, especially when using untyped languages. However, a maintenance issue that can arise comes from one team updating a data structure that’s accessible to the broader system. Another system will then fetch that data using now-outdated assumptions and error out. A few technologies have shown up to police this scenario by strictly typing the pass-thru layers, such as graphql and typescript.
While these are meant to allow multiple teams to work in isolation without breaking API contracts between components, it’s my opinion that heavily typed backend components seem to lead to more maintenance issues than they prevent, and more engineering hours get spent to achieve the same results. In contrast to the numbers brought up around how many bugs typescript may prevent in the frontend, I’ve observed that adopting typescript components in the backend creates a brittle aspect to distributed systems that can be very troublesome. For one thing, it’s very difficult to draw lines around types. Many types that are commonly used depend on other types, creating a dependency chain that often requires the entire org’s up-to-date schema to be available to every component in development, testing, qa, and production environments. There also seems to be a core set of types, e.g. UserIdentity, that every component needs to be aware of. The way I’ve seen this solved is by having that type defined in one system, and having every other system be dependent on that definition. Oftentimes I see developers committing point-in-time snapshots of remote schemas so their local schemas will properly validate.
I picture the platform of independent components converging into a larger structure, à la Voltron. To me, the heavily typed languages act less like the steel girders you’d want supporting that structure, and more like glass rods. It’s difficult to bend or mold code without constant shattering. If nothing else, it takes more lines of code to build the same features, and each addition or change to code incurs a need to properly update types, and more often than not the process is slowed by needing to solve for problems created by the typing technology, like peculiarities around graphql and typescript that can stymie forward momentum. For example, in graphql systems that use typescript, you can’t reuse the same definition between TS and GQL, so you need to create two redundant sets of types written in each syntax, and both need to be edited in tandem while developing.
I also find the way typing has been incorporated to be very confusing. Going back to CompSci 101, typing comes from the fact that, under the hood, computer languages need complex machine instructions to store different types of data in memory. Untyped languages purposefully introduce inefficiencies in order to let developers declare variables in a lax manner. Then, the parser figures out variable types without the need to heavily annotate your code. This works very well for systems that can spare memory or time, like a typical RESTful API. However, the addition of strict typing in untyped languages has nothing to do with performance. If anything, adding typing on top of javascript takes more resources without improving the performance around handling variables. Instead, its main purpose seems to be preventing the dreaded “undefined” errors endemic to javascript, which has traditionally been a frontend concern where solid logging and alerting weren’t always the norm. To that end, you can achieve the same result using static analyzers, proper testing, and proactive logging. Enforcing field types and required fields seems to mainly prevent errors that come from poor quality code.
Additionally, typing is not consistent across languages. Each language has variations in its handling of primitive types, further hurting interoperability of the type safety layers. If you have a polyglot platform, the handling of types like Number/Null/String can vary depending on which language you’re currently in. Unfortunately, most pass-thru layers have wrongly assumed that the origin, destination, and middle layers can always agree 100% on typing. In some cases, the origin knows about numeric types the middleware doesn’t, and sometimes, in a baffling show of strictness, the middleware keeps independent enum definitions that only serve to cause maintenance headaches.
I also prefer to view data in the system as defined by the system that creates it. Having to add a redundant layer of annotations so that data can pass through another system seems like a good way to incur more busy work without adding to the capabilities of the application.
Recently, I’ve been experimenting with Confluent’s schema registry. On paper, the idea is simple. A central registry of types allows distributed components to look up type definitions on demand, allowing you to avoid hardcoding schemas into each component. However, that’s not really what the schema registry does. First of all, each type definition is bound to one of a few flavors of schema definition syntax. Either Avro, JSON Schema, or Protobuf. Already, that means components written in different languages need to add a dependency on the selected syntax’s parser. For something like Avro in python, there’s a non-trivial impact on build size and memory usage when including avro packages. Also, instead of hardcoding schemas, you need to hardcode schema IDs. A producer must know the ID of the schema it wishes to produce. However, you actually can’t hardcode the IDs, because SR doesn’t currently support ensuring IDs are consistent across dev/qa/prod environments. So now, rolling out a change to a producer across dev/qa/prod involves first updating the schema registry in each environment, which creates a new schema version with a NEW ID. Then you need to update the code to accommodate the actual change in the data structure, and update the env vars of each deployed environment with the new ID it should use when publishing. It’s absolutely a recipe for botched deployments.
Additionally, if you’re familiar with avro, you’ll know that it’s a binary format that contains its schema in the data. Using the schema registry means you need to remove that schema definition from the message and replace it with the schema ID (which, again, varies by environment). So why would you take the schema definition out of the avro definition, making messages unparseable without the remote dependency of a schema registry, and tying them to a specific environment? The stated purpose is so Java processors can get a performance boost when processing huge numbers of messages with the same schema ID. Instead of the processors reading in the schema from each record, it instead finds the schema ID integer where the schema would have been. If the code is set up properly, it can then check and see if a schema with that ID is already in memory. If so, it can skip the performance hit of parsing and loading the schema from the message. If not, then it needs to make a network request to the schema registry to load the schema with that ID. So it only really helps when processing batches of messages with the same schema id.
It’s also not trivial that using the SR adds a remote dependency for development and testing. Components that could previously be run in complete isolation now need access to specific environments in order to run. The only way around this is to *cough* hardcode the schema in the test files and mock out the network request to the SR.
Also, schema IDs can collide, with multiple schemas getting registered under the same ID. So that’s not great… Also, schemas can be deleted entirely. I’ve seen developers try to commit tests that rely on a dev environment SR, where the tests would quickly start to fail when that dev environment gets wiped.
As I see it, all of these good intentions to prevent hardcoding schemas and inconsistencies across components end up forcing you to adopt immature tooling and practices that cause more issues than they prevent. I think the theory has potential, but the tech has a way to go before it’s mature enough that it starts producing the purported benefits.