Identifiers in Microservices Systems

In a microservices system data tends to get spread out. Without a consistent mechanism for handling identifiers it can become increasingly difficult to follow how data flows through the system. And, it can become problematic to even know what data is being dealt with.

In many monolithic applications a database tends to act as the source for identifiers used for resources. The identifiers also tend to be just auto-incrementing integer values. In larger systems with sharded databases the incrementing strategy tends to get more complicated, but the mechanism remains essentially the same.

This same strategy of relying on the database tends to also be the favored approach for many early iterations of microservices systems. And, this makes perfect sense if the system is being created by breaking down an established monolith. But, in some circumstances it likely makes sense to think through a system-wide identifier strategy, rather than allowing one to emerge piecemeal or just passing around easily confused simple integers.

Twitter’s Snowflake Service

Several years ago I heard about Twitter’s Snowflake service. This service was responsible for generating a unique, sequential, identifier for every tweet that Twitter received. Snowflake’s identifiers encoded the service instance that generated them, along with a timestamp, and a simple counter to avoid collisions when generating 10,000+ identifiers per second.

The reason this service was created was multi-faceted. At the most basic, it was because in moving from MySQL to Cassandra there was no way to rely on Cassandra to generate unique identifiers. But, another factor was wanting to generate the identifiers before the data was guaranteed to be ready to go into the database at all. The Snowflake service took care of both of these issues and added the capability to generate unique and sequential identifiers at a rate that would heavily tax even a highly performant database system.

Twitter’s solution to their particular problem was very effective. The downside is that it fits best when there is one major entity that your system cares about. In most complex systems there are typically several, if not more, significant entities that require unique identifiers, and while drawing them all from the same space may be acceptable, it’s a limitation that is not difficult to overcome.

Correlation IDs

One well established pattern for microservices is the use of Correlation IDs. The purpose of these identifiers is to track requests throughout a system from when a request enters until whatever is needed is complete. In many cases these identifiers are just as naively generated as the auto-incrementing identifiers used by databases.

In many cases these identifiers are generated by a tracing service and they are only persisted in logs. So, the specificity of this solution doesn’t address the general matter of resource identification. But, it provides an interesting use case that any sufficiently complex microservice system should address, and which could overlap with a more general approach.

Microservice Identifier Strategies

When developing a strategy for identifiers in a microservice system there are several factors to take into account. First, you should be using correlation identifiers, so your strategy needs to address them. Second, the more resources your system considers important the more value you are likely to derive from encoding resource type identification with the identifier. Third, application of an identifier should take place as soon as a resource has meaning, even if it is not persisted yet.

Another detail of any strategy is the amount of potential information your identifiers can leak about your system. Many will opt to use UUIDs, but there are issues with leakage here as well. Unless Version 4 UUIDs are used exclusively there is the potential to leak information about the date and host where the identifiers originate. For me this does not seem like a major concern, but different environments will have different tolerances for information leakage like this. The downside of randomly generated UUIDs is the loss of sequential meaning and the inability to extract any other meaningful information from the identifier.

In most cases I think taking the ideas of Twitter’s Snowflake service and expanding on them to generate identifiers that can encode certain details, including their time of generation and the resource type they represent is valuable. Those two details, with a little bit of randomization is a level of data leakage I tend to be comfortable with. Identifiers with these characteristics could work for both resources and correlation, they can support an extensive number of resource types, and they can be generate via very simple algorithms that can be easily scaled. So, they would satisfy the factors I feel need to be addressed.

In the coming weeks I will be working on a prototype of the kind of identifier service I’ve described above. I will be building on the very fast approach used by Twitter’s Snowflake service, but with some additional details meant to encode other details, including resource type and/or purpose. Hopefully, whether we end up implementing this work or not at Nav, I will be able to open source the results for others to borrow or build upon.

In February I will be presenting on how to build a technical coaching program at O’Reilly’s Software Architecture Conference in New York City. There will be lots of other presenters covering the latest in software architecture. If you are interested in attending, register online and use the promo code AFF20 for 20% off.