Snowflakes and other UUIDs

in #dev9 days ago

snowflake

In software projects there's often a need to assign a unique identifier to a data record or entity so that it can be distinctly recognised. Aside from sequential numbering, there are two common methods of generating such IDs: UUIDs and snowflakes. This blog post describes both of them pointing out their strength and weaknesses.

UUID

UUIDs are 128-bit values that are meant to be unique for practical purposes.

An example UUID is a5de120b-fe73-4fbf-afbf-061551b299bb.

All UUIDs have the following structure: xxxxxxxx-xxxx-Vxxx-Rxxx-xxxxxxxxxxxx
The values in the V and R locations uniquely identify the UUID version and variant respectively. The x values are set depending on the version of UUID.

So we can see that the above example ID is a version 4 UUID.
89b910b8-59c1-11f0-94b8-325096b39f47 is a version 1 UUID.
0197db8e-0f63-74a6-9ee5-d402424d1cd4 is a version 7 UUID.

There's a few different UUID versions and I will describe them out-of-order, because it will be easier to explain:

Version 1 (timestamp + MAC address)

Concatenates a 60-bit timestamp with the 48-bit MAC address of the node that is generating the UUID. This sounds pretty neat and useful, until you find out that the timestamp is the number of 100-nanosecond intervals since midnight 15 October 1582 UTC - the date of Gregorian reform to the Christian calendar. This is quite a weird encoding. What's worse, timestamp bits are stored in the UUID from the lowest to the highest bits, which means these IDs cannot be trivially sorted by the date. The advantage of this version is that it is guaranteed to be a unique ID, unless generated on the same computer and at the exact same time. The issue is that it exposes the generating machine's MAC address.

Version 6 (timestamp + MAC address)

This is the same as version 1 except all timestamp bits are ordered from most significant to least significant. This fixes one issue with version 1 - IDs can now be trivially sorted by the date.

Version 4 (random)

All the bits are generated randomly. This means there's potentially a chance of duplication, but this is almost impossible.

Version 5 (namespace name-based)

This generates a unique ID from a SHA-1 hash of a concatenation of a namespace and a name. There is no temporal or random component, so the same input produces the same output every time. If you need to generate reproducible UUIDs from given names, you likely want this version.

Version 3 (namespace name-based)

The same as version 5, but uses MD5 instead of SHA-1. Unless you know you need it, you should use version 5 instead of this one.

Version 7 (timestamp and random)

Begins with a 48 bit big-endian Unix Epoch timestamp with approximately millisecond granularity. The following bits are random seeded counter

Version 2 (DCE Security version, with embedded POSIX UIDs)

Not very widely used.
Similar to version 1 - uses the current time, along with the MAC address, but replaces the low part of the time field with a local identifier e.g. the user ID or group ID of the local account that created the UUID. Compared to version 1, it has the advantage that you can know not only when and where the ID was created, but also who created the identifier. Replacing part of the timestamp with local ID has some important drawbacks though. It reduces the number of unique ID that are possible to generate and makes the timestamp lossy - they only have granularity of about 7 minutes.

Version 8 (custom)

There are one two requirements:

  • The variant bits have to be 10.
  • The version nibble has to be the value of 8.

The remaining bits are up to the vendor to customise.

The advantages of UUID are that they are standardised and there's many ready to use libraries. One disadvantage is that they are 128-bit values, which might be pretty big for some use cases. There might be other issues depending on the exact context.
For example if a deterministic ID is needed then only versions 3 and 5 are appropriate. If a timestamp-sortable ID is needed then only versions 6 or 7 are adequate. But what if both properties are needed? Versions 3 and 5 are not sortable by time, and versions 6 and 7 are not fully deterministic.

Snowflake

Snowflake were originally created at Twitter as a performant way to create a large amount of unique IDs in distributed context.

They are in spirit similar to UUID version 6. They contain timestamp, generating node's ID and a sequence number. Similarly to UUID version 6, snowflakes are trivially sortable by time. The big difference is that snowflakes are 64-bit compared to 128-bit UUID. For this reason they don't have any particular representation - they are presented simply as an integer.

Twitter's snowflake has the following fields widths:

  • sign bit - unused
  • 41 bits for the timestamp
  • 10 bits for node ID
  • 12 bits for sequence number

Because it's just an integer without any structure, it's also very easy to customise. Simple bit manipulation operations are enough to create your own snowflake type in handful of lines. It's easy to change bit widths of the fields and also add or remove fields.

The main issue with snowflakes is that they are less well-known and there's less ready to be used libraries. It's also not standardised, which combined with the easy customisability cause that the existing libraries might not be compatible with each other.

With all this freedom you can create snowflake type that is both sortable by time and deterministic. It's just a matter of proper field interpretation.

pg-snowflake

I created a Postgresql extension that can create customisable snowflake IDs. You can create many distinct ID types, each having different bit allocations for its fields.

Here's the Github page for pg_snowflake extension.

Other snowflakes

There are a few other libraries creating snowflakes and UUIDs, you can find some here:

Read more

Sort:  

For this particular audience, it's probably useful to add a link to the snowflake type for user notifications we will be using in the next release of hivemind: https://gitlab.syncad.com/hive/hivemind/-/merge_requests/885