brandur.org

Identity Crisis: Sequence v. UUID as Primary Key

Looking down the coastline towards Pacifica

Last month, Cybertec put together an excellent piece on the use of UUIDs vs. bigserial vs IDENTITY as primary keys in a Postgres database. Although freshly written, it’s only the latest talking point in a long-standing discussion amongst DBAs and database nerds that’s been ongoing for years – should you use integers (AKA serials, AKA sequences) or UUIDs as primary keys?

Its publication was quite timely for me, as I’d been considering it very recently in the context of our own application, which I joined to find was using UUIDs, contrary to what I probably would have chosen had I been there at the beginning. It’d come up elsewhere as well. Years ago when we published the Heroku HTTP API design guide, we suggested using UUIDs to identify all resources, but without any real rationale for why, and quite fairly, someone had recently opened a GitHub issue on the repository to challenge the point.

Over the years since the guide was written I’d starting leaning further back in the direction of integers being the right answer, but now that I’ve read into the subject more deeply, I’m less convinced that there’s a right answer at all. There are small advantages over the other on each side, and some pitfalls that should clearly be avoided, but when used mindfully, both integers and UUIDs make quite workable solutions, with (in my opinion) appropriately levered UUIDs edging out a little ahead.

CREATE TABLE has_uuid_pkey (
   id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
   ...
);

UUIDs are 128-bit sequences that are canonically represented in hex-encoded string form like 63bb4648-0b42-497f-9ad8-571cd60ee4f3. It’s worth remembering that although we’re used to seeing them as strings, when they’re being stored (properly) in a database, to the machine they’re not all that different from any other sequence of bits – they’re twice the length as a standard 64-bit integer, but stored and compared similarly.

When we’re talking about UUIDs, we’re often talking about V4 UUIDs, which is a fancy way of saying that the entire UUID byte sequence is generated randomly, as opposed to using an algorithm based on clock sequence or anything else.

UUIDs as primary key aren’t a slam drunk, but do have some advantages:

But these upsides are heavily counterbalanced by serious performance considerations:

At smaller scale, neither of these is likely to cause much trouble. Postgres is very efficient, and working with UUIDs will mostly be fine. However, in a hot system doing a lot of inserts, the effect is significant. Here are some charts from 2nd Quadrant showing the lopsidedness of using random UUIDs versus more deterministic identity techniques. With a large data set, notice how poorly randomness performs on transaction throughput and how much much additional WAL it produces:

Transactions per second comparison

WAL size comparison

(These courtesy of 2nd Quadrant. The “small” data set transacts on an empty table, “medium” fits in memory, and “large” exceeds memory. In the first graph, see how the blue bars (random UUIDs) for medium/large data sets are so much smaller than the other strategies, meaning fewer inserts are possible. In the second, see how random produces orders of magnitude more WAL volume. Actually, the comparisons are a little more complex than UUID vs. bigserial but give you the right general idea – more below.)


This post was originally broadcast in email form. Can I interest you in seeing more like it? Nanoglyph is never sent more than once a week.

CREATE TABLE uses_serial (
   id bigserial PRIMARY KEY,
   ...
);

Now onto the advantages of “just” using a bigserial, which is Postgres shorthand for an integer field whose values are populated by an autoincrementing integer sequence.

(Side note: I’m going out of my way here to talk about bigserial, which is a 64-bit integer, rather than just serial, which is 32. Simplifying only an itsy bitsy tiny bit, in today’s world there is never a good reason to use a 32-bit sequence rather than a 64-bit one now that 64-bit architecture is so ubiquitous. There’s a very real possibility of overflowing a 32-bit sequence, while with a 64-bit sequence there’s isn’t. Always use bigserial.)

bigserial advantages:

But by far the largest benefit of a bigserial is that it doesn’t have the downsides of a UUID. New insertions will occur on relatively few pages in an index, allowing more transaction throughput, and creating minimal WAL volume. Performance improves and operational risk lowered.

One subject touched upon in Cybertec’s article that was new to me was the IDENTITY keyword:

CREATE TABLE uses_identity (
   id bigint GENERATED ALWAYS AS IDENTITY
             PRIMARY KEY,
   ...
);

IDENTITY has identical performance characteristics to a bigserial, but it’s more standard across SQL dialects, and more importantly, it’ll error if a row is inserted with an arbitrary value – all insertions must get a value from the default sequence. Although it’ll make almost no difference for most day-to-day uses, it’s yet another bug-proofing feature that RDMSes are so good at providing.


Where things get muddied is that as it turns out, it’s possible to use UUIDs without the downsides by introducing some determinism into the generator algorithm – or put otherwise – don’t blindly use V4. 2nd Quadrant writes about sequential UUIDs, using the sequential-uuids extension to introduce two new UUID generator function:

Both operate under the same general idea: reserve some amount of fixed space at the beginning of a UUID for non-random data, and leave the rest of it for randomness. uuid_sequence_nextval uses a sequence to produce the non-random portion while uuid_time_nextval uses the current timestamp. Both are highly configurable, but a decent default for uuid_time_nextval would be to use a block with 64k values, meaning that we’d reserve 16 bits at the beginning of a UUID for the sequence, with the other 112 bits still random:

    82cc              495e-9d2a-4b18-af79-73de1cb3209e
|----------|    |------------------------------------------|
  sequence                         random
  16 bits                         112 bits

64k values isn’t enough to continue incrementing perpetually, so the generator cycles back to the beginning of the sequence regularly:

prefix := (nextval('s') / block_size) % block_count

If you visualize a B-tree as a one-dimensional continuum, you can think of the UUID generator as a cursor slowly moving from one end to the other, busily producing values along the way. When it brushes up against the far end by exhausting its prefix space, it jumps back to the opposite side and starts over again. There’s enough entropy in the suffix that inserts will never conflict no matter how many times it wraps back around.

See 2nd Quadrant’s article on the subject for more detail, but the short version is that in practice this solves the performance problems with UUIDs – new inserts map to the same set of new pages, meaning fewer pages in memory and fewer cache misses, leading to good throughput and less WAL amplification.

A generalization of this idea beyond Postgres are ULIDs (“Universally Unique Lexicographically Sortable Identifier”). They’re a 128-bit type compatible with UUID (meaning you can store them into the uuid type in Postgres or in various programming languages) that reserves the first 48 bits for a timestamp and the rest for randomness:

    01AN4Z07BY             79KA1307SR9X4MV3
|----------------|    |------------------------|
       time                   randomness
      48bits                    80bits

(By convention they’re encoded in base-32 instead of hex, which is why the string representation looks different than a UUID’s.)

The timestamp is an integer of the time in milliseconds since the Unix epoch, and 48 bits is enough runway to not overflow until 10,889 AD.

ULIDs differ from the Postgres UUID generators above in that the frontloaded bits are always incrementing and never recycle. This has the nice advantage of keeping all ULIDs created anywhere nicely sortable compared to each other. It has the downsides of being more prone to collision and predictability because there are fewer random bits, and could produce a lopsided index if many values are inserted then subsequently deleted, leaving a large space of sparse pages that will never get reused.

A nice thing about ULIDs is that the algorithm is implemented in every programming language under the sun including native SQL for Postgres, only requiring access to the pgcrypto extension. The UUIDs generators above are their own Postgres extension, which is a problem when it comes to hosted Postgres.

Fun fact, Stripe resource IDs are more or less a ULID:

ch_1Iyqf42eZvKYlo2C4MIyA80e

They don’t share anything with the ULID spec, but are an example of multiple discovery of a fairly obvious idea. Some number of those bits at the beginning are a timestamp while the rest are random entropy.

Another Stripe concept that’s not a bad idea is attaching an ID prefix like ch_. The prefix makes it very easy for a human to match an ID to a resource type, and also enables the development of some useful tools. At Stripe we had an internal admin lookup panel where you could plug in any ID and it would instantly take you to a representation of that object (or even just paste one into your URL bar like http://go/object/<id>) – very convenient for debugging production. Without the prefix, this would require a brute force lookup on every internal table, or maintaining a single master lookup table, but with it, we just translate prefix to resource type and do a lookup in the corresponding table. Fast and simple.

A small potential problem with ID prefixes is that the space efficiency of storing a plain bigserial (exactly 64 bits) or uuid/ulid (exactly 128 bits) is pretty nice. Not only are they very compact, but database optimizations like SortSupport make them faster to sort and build indexes on as well. Pushing a string prefix onto the front would seem to compromise that by forcing the ID to be stored as varchar/text. Luckily there’s a simple workaround to get the upside of both – keep IDs stored as compact binary and only add prefixes at the time they’re being rendered publicly:

def render_id
  'ch_' + uuid.string()
end

Similarly, when interpreting a lookup for such an ID, parse off the prefix before querying the DB.

(And by the way, with this one small tweak and you’ll be significantly one-upping Stripe from a technical sense because everything there is stored as a naive string.)


So where does that leave us? Well, like I said at the top, there’s no overwhelming victor, but we can try to bucket some takeaways anyway.

Good patterns:

Not quite as definitive, but also probably good:

Bad patterns:

The UUID versus integer debate has been ongoing for a long time, but one of its distinctive features is that the strengths and weaknesses of one approach or the other tend to be poorly understood. As noted above, we were guilty of this when suggesting UUIDs as part of the Heroku HTTP API design guide.

In my first job out school we were somewhat of technological forerunners in that we’d adopted UUIDs (or as Microsoft calls them “GUIDs”, because the fine distinction between “universal” and “global” is very important in the Enterprise™) as primary keys early and completely in our SQL Server database. No one at the company was a database guru, so there hadn’t been a whole lot of rationale behind the original decision to use UUIDs, but it’d been made very early and we’d stuck to it consistently, so the convention was widespread.

After running into database performance issues, we hired a very expensive DBA contractor, who immediately and definitively identified our bottleneck as UUIDs (obviously!), and went to work (or more correctly, sent us to work) converting UUIDs to sequences.

At the time I didn’t know anything about databases or UUIDs, and was blasé towards this sort of dictate-from-on-high change request from a DBA, which was somewhat routine. This was a time when good open-source databases were not nearly as widespread, and corporate RDMSes like SQL Server and Oracle still held special mystique as magical black boxes, along the rare wizards with sufficient talent with the occult to peer inside of them. We made the changes, and spent hundreds of hours converting dozens of UUID keys to sequences.

In retrospect, I realize that although I knew little about databases at the time, this was a trait that I had in common with our expert DBA. None of our performance problems were insert-related, and for a smart database engine, joining on a 64-bit key isn’t functionally much different than joining on a high-entropy 128-bit one (again, see SortSupport).

This is a long-winded way of saying that UUIDs are not as bad as many greybeard DBAs make them out to be, especially when applied properly. Personally, when I started research for this piece, I expected to come out having convinced myself that bigserial was a clear winner. That hasn’t happened, and if anything, I’ve landed more in the camp of “intelligent” (i.e. performance aware) UUIDs.


The photo at the top is from a Bay Area walk that I’ve been wanting to try for a long time, and finally did last weekend. Start at the geographic center of San Francisco (Twin Peaks), make your way down through Golden Gate Park to the ocean, south along the Great Highway, onto the beach passed Fort Funston and way down to Pacifica, then up along Sweeney Ridge, until finally descending into San Bruno. Clocking in at a little over 42 km, it’s almost exactly the length of one marathon, but at walking pace, you need the season’s long daylight hours to make it work.

Walk to San Bruno

It’s a longer one, but if you’re in the San Francisco area, I’d encourage you to go down at least as far as the Great Highway if you haven’t already done it. Both it and part of JFK are still open to people and it’s amazing there right now, especially when the crowds come out on the weekends. But given the city’s general affinity for regressivism, I wouldn’t count on these Covid-era boons lasting much longer, so get out and enjoy them while you can.

Pushing all the way to San Bruno is a great way to experience many of the Bay’s little ecologies all in one day – from the eucalyptus and redwood groves of Golden Gate Park, to the crashing waves of the Pacific, to the dry California terrain along its impressive ridgelines, often smothered by low hanging cloud, despite a pristine blue sky above.

Until next week.

Looking down the Great Highway

Pacifica

Sweeney Ridge

South San Francisco