Canonical Log Lines 2.0

Redwoods in Heroes Grove, Golden Gate Park

I’m about a month into my new job – still at a point where everything feels fresh and exciting, and not long enough to have accumulated even an ounce of jadedness. So far it’s exactly the change I was looking for. Life is less frantic at a smaller company, and yet the feeling of relative calm is misleading. Intuitively it would seem that a more relaxed environment would result in commensurately less output, but it’s the opposite. By slicing ten layers of overhead out of the stack, individual people are able to do much more. My time’s flipped from a previous ratio of 9010 busy work/development to an almost perfectly inverted 1090, with the 90 spent building things. It’s been fantastic.

There’s a lot to talk about – self-correcting transaction-based systems in Postgres, organizing moderately large Go projects, best practices for HTTP APIs in 2021, and more. I’d originally started baking it all into one very long post before realizing that it would need its own table of contents and glossary. Instead, I’m breaking it up, with this issue containing some highlights from the first month and some reflections on canonical log lines.

As usual, I’m Brandur, welcome to Nanoglyph, a newsletter about service toolkits and vintage bike messenger photography.

Back at Olde Heroku, we were some of the original purveyors of microservices. Knowing what we know today, that’s something of a dubious credential, with the industry as a whole having gone through an entire expansion and contraction phase since, initially buying into the idea wholesale before overcompensating with a hard turn into skepticism as the downsides of a fragmented data model and major coordination overhead became more evident, before finally adopting the more neutral outlook that many of us hold today.

We practiced what we preached, with separate systems for routing, dyno lifecycle management, Postgres, user-facing dashboard, billing, and orchestration, all chattering to each other at hyper frequency to keep the cloud healthy and alive. We invented names of endearment for services and interfaces boundaries alike – Hermes, Shushu, Lockstep, KPI (kernel platform API), Yobuko – and they all spoke the lingua franca of the internet, HTTP.

I worked on a team called “API” which managed a component called api which acted as, you guessed it, an API, specifically the HTTP interface used by Heroku’s dashboard and CLI. A command line invocation of heroku create myapp translated into a curl -H "Accept: application/vnd.heroku+json; version=3" -X POST as it left your terminal.

Some of our closest collaborators were the DoD (not the real one 1), who ran a component called “Shogun”, powering Heroku Postgres, and later Heroku Redis and Kafka. Shogun could be thought of as a state machine with an HTTP API, designed to create databases and manage their lifecycles through provisioning, maintenance, failover, recredentialing, and eventual retirement. Just like a 3rd party addon provider, when a database needed provisioning, the API would reach out to Shogun, ask for a new resource, and inject the connection string it responded with into the environment of an app. It probably seems like Shogun would have been managing big state even if not enormous state, and that probably would’ve been the case if not for Heroku’s generous free tier, which along with free dynos, gave a free database to anyone asked for one. A few years into the Cedar stack’s lifespan, free tier databases were as plentiful as distant stars in the night sky, winking in and out of existence every second of every day, and guaranteeing Shogun permanent job security.

Heroku sign at 650 7th

Joining Crunchy, the setup I’ve inherited is eerily similar to what I used to work in back then, and even involves some of the same people.

With Crunchy Bridge, a frontend written in TypeScript and React is itself stateless, and delegates operations to a backend API written in Go on Postgres (referred to as simply “Platform”), and responsible for keeping track of everything from users to provisioned clusters to subscriptions. And similarly to Heroku, the platform in turn delegates cluster management to yet another backend service, called Owl. Like Shogun, it’s primarily a state machine, initiating operations, then tracking them through to completion. The service responsible for provisioning Postgres databases is of course itself built on a Postgres database, and it leverages transactional guarantees to make sure that no state is ever lost and no action ever skipped.

My job’s to help run and develop the middle platform tier. It’s a mature codebase that runs a production service, but everything exists on a continuum, and after leaving a company where over the years we wrote code to handle every conceivable edge (and that poor code showed it, with hundreds of thousands of if/else conditionals and complications to handle countless special cases), it feels young.

Certainly there was plenty to do in my first couple weeks, with most of my work put toward getting us up and running with a basic service toolkit, taking the platform from a running program to an observable running program that’s conducive to debugging production and tells us about problems when they happen. A few features installed so far:

Distributed Sentry trace

In terms of mostly meaningless stats, I’ve touched about 40k lines so far:

Number of lines changed

It’s been cathartic. Platform and Owl are two of the cleanest codebases I’ve ever had the pleasure to work in professionally, and contributing to them has reminded me that it is possible to have a job in tech where the technology is actually fun to work with, something that I’d started to forget over the last few years. If you’re stuck in a similar position, and have spent your day implementing a tiny feature that should have taken five minutes but took five hours, squashing a bug that only existed in the first place by virtue of overwhelming systemic complexity, or debating people in 37 different Slack threads to get to a consensus that’s 3% better compared to if someone had just made a decision themselves, then I don’t know, it might be worth reminding yourself of that.

A while back, I wrote about an idea called canonical log lines, which I still stand by as the single, simplest, best method of getting easy insight into production that there is. (I should note that these were not originally my idea, but also that I don’t know whose they were.) Starting at Crunchy, I wanted something similar immediately, and started coding them on day one.

They’re a dead simple concept. In addition to normal logging made during a request, emit one, big, unified log line at the end that glues together all the key vitals together into one place:

canonical_log_line http_method=GET http_path=/v1/clusters duration=0.050 db_duration=0.045 db_num_queries=3 user=usr_123 status=200

When operating production, you’re occasionally interested in specific requests, like if one produced a new type of error that needs to be investigated. But much more often you’re interested in request patterns. Like, are failing more requests now than an hour ago? Does increasing DB latency suggest that Postgres might be degraded? Is one single user hammering us inordinately hard?

This is where canonical log lines come in, and where they come in to perform magically. Along with being a one stop shop for seeing every important statistic of a request in one place, they’re also designed to be as friendly to aggregates as logs can possibly be. Combined with a system like Splunk, I can answer all of the questions above with ad-hoc queries like these:

canonical_log_line earliest=-2h | timechart count by status

canonical_log_line | timechart p50(db_duration) p95(db_duration) p99(db_duration)

canonical_log_line earliest=-10m | top by user

Canonical log lines are the one tool that I used reliably every single day of the 5+ years I worked the job. It might seem hard to believe if you’ve never used something similar, but they really are that good.

This time, since we weren’t using Splunk, and since I was responsible for setting up our logging, I decided to try something a little different. Instead of old trusty logfmt (the key/value pairs above that look like http_path=/v1/clusters duration=0.050), I configured our logs to be emitted as fully structured JSON.

I’ve always argued that logfmt is a great operational format because while it’s machine parseable, it’s also relatively friendly to humans, whereas JSON is clearly not. But this matters somewhat less if your logging system is going to be parsing it all for you anyway. For most types of logs, LogDNA (our current Splunk alternative) will just echo them back to you. It’ll parse both logfmt and JSON logs when it detects them so that the metadata is available for aggregates, and usually just echo those back too. However, if a JSON log contains the special key message, it’ll show just that on the display line, with the other information available by expanding it. This leads to a very clean log trace:

Clean trace of log lines

We emit message as a very terse human-readable explanation of a request’s result:

canonical_log_line GET /api/v1/teams -> 200 (0.004936s)

(By the way, I don’t think the digest necessarily has to be this spartan. We’ll probably add more to it as we figure out what we use the most from our logs.)

But all the information is still there when we need it, and in a much more human-friendly way:

Log line expanded

Because of their hidden-by-default nature and clean visualization, we can do things that might not otherwise be a good idea. You can see in the image above that I’m dumping a sentry_trace_url for every line. Normally that’d be a lot of visual noise in the logs, so I’d avoid it, but since that’s not a problem here, it’s nice to provide an easily-copyable URL to visualize the request’s trace right in Sentry. If the request ended in error, a sentry_error_url also gets dropped in so that I can easily go see the full back trace.

Maybe most importantly, all that data is there behind the scenes in a machine-readable format, so aggregates are still possible:

Aggregating logs

An even more powerful feature unlocked by JSON is the capacity for multi-level nested structures. You probably don’t want to go too overboard with this, but as an example, we’ve found them nice to track every backend service call that occurred an API request along with its interesting vitals (see owl_requests):

Nested structures in logs

(And while the rich structures are nice, we still include “flattened” information as well like owl_duration and owl_num_requests for easier/faster use in aggregates.)

In previous work environments, getting this information probably would’ve been possible, but it was like pulling teeth, and often stressful too because you’re usually only trying to figure out how to get it during an operational emergency. Here, we don’t just make it possible to find, we make it easy.

Alright, enough about tech. Today, I’ll leave you with this link to some incredible photography of bike messengers working in downtown Toronto of the 90s by Trevor Hughes, presented at high resolution and with minimal cruft. Great subjects, a fascinating moment of history, and beautiful photography at at time when beautiful photography was hard.

This isn’t a scene that I ever actually experienced, and yet it still hits me straight to the gut with powerful nostalgia. The grunge style of the 90s reaches seemingly unattainable levels of je-ne-sais-quoi cool, there’s a certain romance of tight-knit communities of young people living in inner cities, and nary a smartphone in sight. Not only did people still work together in buildings back then, but physical objects needed to be rush couriered between places, something that seems so quaint today amongst our interminable flood of JavaScript and email.

Until next time.

1 DoD = Department of Data.