Pouria Hadjibagheri and the Cascade of Doom
November 15, 2021 1:14 AM Subscribe

Cascade of Doom: JIT, and how a Postgres update led to 70% failure on a critical national service

Pouria Hadjibagheri, technical and development lead of the official UK COVID-19 Dashboard, explains in detail how a serious issue with the UK's primary data reporting service for the COVID-19 pandemic was resolved.

For a bonus, here's a talk by Pouria in March 2021 at the Centre for Human Computer Design Interaction on the development of the dashboard, including the technology that powers the service, design and development techniques, challenges with running a major national service and the difficulties and advantages of open source software and open data.

posted by knapah (29 comments total) 19 users marked this as a favorite

I love these sorts of reports and the sort of WTF? conditions they come out of. I know it's going to be a good day, eventually, when a client comes to me with an impossible issue that turns out to be true (EG: I can't email more than 500 miles away). If all support was like that I'd still be sitting behind a desk.

Hadjibagheri responds in the comments and makes the comments worth reading. One thing I took a long time to learn is that there always unseen constraints when viewing a system from outside that usually accounts for any "why didn't they do this basic thing that every professional does?" Assuming it's because of an an unseen constraint [rather than incompetence) and trying to reverse engineer the constraint can be a lot of fun and enlightening even though in most cases it comes down to money.
posted by Mitheral at 6:21 AM on November 15, 2021 [22 favorites]

The constraint I wonder about is: why upgrade the database at all? It sounds like upgrading may have fixed some minor bugs, but I wonder if the risk was really worth it (in hindsight, maybe not). Probably it does come down to money -- the new version theoretically performs better, which equates to some amount of savings on the hosting bill, so of course upgrade.

There's probably an interesting thought experiment here about viewing late-stage capitalism through the lens of cloud computing (and the economic incentives that drive it), and vice-versa.
posted by swift at 6:58 AM on November 15, 2021 [1 favorite]

Very interesting window into some form of modern service administration. I wonder how representative this is? I have to admit that, as someone who (as part of a hobby only) maintains some much older, creakier, and smaller databases/services on much-less-funded hardware, I am utterly flabbergasted at the apparent casualness that they decided to upgrade the database by a major version to a brand new release, something the author is still defending in the comments. I guess (based on the replies) they needed the performance around some specific db issue, and have saved money in consequence, but still, I can't help but think that there are some not-explicitly-valued costs that the calculation has not really factored in. But I guess at this point they've probably paid them...
posted by advil at 7:00 AM on November 15, 2021 [2 favorites]

Hadjibagheri responds in the comments and makes the comments worth reading

I was amused by the comment where he says a pro of using the system they do is that he has the core dev for Postgres on speed dial.
posted by knapah at 7:26 AM on November 15, 2021 [4 favorites]

According to the PostgreSQL documentation it sounds like JIT should not have been turned on in the first place for the types of queries they were making. Maybe bad defaults?

Sounds like they also need some sort plan stability system to make sure query plans don't suddenly change in production.
posted by credulous at 10:04 AM on November 15, 2021 [1 favorite]

They apparently only tested it for 3 days. That's just crazy for a database upgrade.

"we deployed the new version in our development environment in the evening of Wednesday, 13 October 2021. It went very smoothly and we monitored everything for 48 hours and did some load testing before setting off the deployment to the production environment at midnight of Saturday, 16 October 2021"
posted by The_Vegetables at 10:25 AM on November 15, 2021 [2 favorites]

J. Walter Weatherman voice: And THAT's why you never update a critical dependency to a new version the week it comes out.

More generally, serving an API straight from a DB is a good idea until it isn't. And when you hit the point of "nobody has ever tried to scale a DB like this" ... there is other infrastructure you can use to remedy this?

Obviously, like, adding a whole indexing step to package the data for serving is a much bigger change than just making your DB scale more, but if your load is like 99.9% reads and you are spending significant $$ and dev time on addressing DB performance, it's time for a change. And sure, sure, they have caching, they use Redis, I get it, but what happens if the cache falls over? You have to really cut the cord and say "how can we make it so (the vast majority of) our open data API queries are guaranteed never to hit the database?" You end up with a persistent non-relational storage layer that gets updated on DB updates, it can be like a NoSQL database or something more like ElasticSearch depending on your use case, but the key is that it is reliable and up-to-date and fully distributed. Now your Postgres database is a convenient source-of-truth and query engine rather than a critical query-time dependency.
posted by goingonit at 2:00 PM on November 15, 2021 [4 favorites]

As a lay person, how much of this is 'good experience tells you not to leap this far, this fast because..." and how much is REALLY, 'don't change anything until you have to because the internet/IT is a straw hut in a fireworks factory and if it's working don't touch a motherfnthing'?

I'm just thinking back to the banking/cobol thread earlier and it seems like the combination of ephemeral nature of products and especially product maintenance/front-back compatibility combined with the verisimilitude of any given complex system means that you just hold on to functionality for dear life.

Is that right?
posted by Reasonably Everything Happens at 6:03 PM on November 15, 2021 [1 favorite]

> As a lay person, how much of this is 'good experience tells you not to leap this far, this fast because..." and how much is REALLY, 'don't change anything until you have to because the internet/IT is a straw hut in a fireworks factory and if it's working don't touch a motherfnthing'?

"Don't change anything until you have to" is of course the safest way to avoid this specific kind of problem, but it has a cost as well. It can be tricky to assess when to take advantage of the shiny new toys vs. wait a release or two, but in this instance, I feel like 2 or 3 days of testing of a major release of a product that is a single point of failure in your architecture is barely tolerable for a small startup, much less a critical national service like this. PostgreSQL has a well-deserved reputation as a very stable product, but a major release is a major release, and you have to expect problems.

It may be true that even weeks more of testing may have never tickled this particular bug, and simulating the exact conditions of real users hitting your service isn't easy, so I'm not saying that longer QA cycles are a panacea. But you have to at least give yourself a fighting chance to expose these very subtle bugs, and I find it hard to believe that can be done in such a short amount of time, even in a mature development shop.

What end user of the UK COVID-19 Dashboard was clamoring for them to Postgres 14? None, of course. Was the improved partitioning performance going to maybe save you a few percentage points on your Azure bill? Maybe. Is that savings worth risking an outage like this? The existence of this blog post suggests not! Maybe there were other factors not illuminated by the post, but the way the author talks about it, it seems more like "well, it's good enough for Microsoft, so let's give it a go", and that's just not how any of this is supposed to work.
posted by tonycpsu at 6:52 PM on November 15, 2021 [4 favorites]

Worth mentioning that you can only delay patches and upgrades up to a point, because some bugs are security vulnerabilities and if you don't fix those you get pwned.

Thanks for this; it was great discussion fodder in my data-management class tonight.
posted by humbug at 7:22 PM on November 15, 2021 [4 favorites]

Also we don't know what issues they were currently having that the update resolved or probably just as importantly what the office political situation is like. Hadjibagheri said in the comments that they still felt the risk was worth it and they are in a much better position to judge than anyone reading the write up.
posted by Mitheral at 7:34 PM on November 15, 2021 [2 favorites]

> Worth mentioning that you can only delay patches and upgrades up to a point, because some bugs are security vulnerabilities and if you don't fix those you get pwned.

To the extent that this is true, this particular case is a pretty poor example of that principle given that this is a new major release of a product with very generous policies about supporting old releases and backporting security patches.

Even in the general case, it's simply not true that security vulnerabilities must be patched immediately or even promptly. The case to upgrade or not has to consider whether the attack surface of the vulnerability really represents an imminent threat. Often, there are ways to mitigate vulnerabilities without patching, or until a patch is available. And of course patches can introduce new bugs. The infosec community is getting better at classifying vulnerabilities, but how vulnerable any individual user of a given library / service is can't be boiled down to a simple categorical field. You have to do your own analysis, which is really hard to do without a dedicated security team.
posted by tonycpsu at 7:45 PM on November 15, 2021

The constraint I wonder about is: why upgrade the database at all?

Because the upgrade treadmill is genuinely a thing, and stepping off it is almost always a frying pan into fire kind of decision. Security threats accumulate over time, they don't go away on their own, and the less time the release you're using has been out there in the wild the less time it's had to get pwnd.

this particular case is a pretty poor example of that principle given that this is a new major release of a product with very generous policies about supporting old releases and backporting security patches

Sure. But the basic choice here is between implementing ways to stay in step with upstream releases of critical components in a reasonably timely fashion that don't make every upgrade a potential disaster, which is hard, or making it de facto organizational policy to kick the upgrade can down the road at every opportunity. And the latter choice means that you will eventually run out of upstream generosity and will be in for a world of pain when the part you now urgently need to update is no longer compatible with any of your existing infrastructure, because you've been treating all of that the same way and the world has moved on.

Technical debt has upsides just like all the other kind of debt, but it's best kept under control. These are complex decisions that should get serious consideration from experienced people and not be left to opinionated ideologues.
posted by flabdablet at 8:04 PM on November 15, 2021 [5 favorites]

So for those saying that it was rushed or they should've done more testing or whatever, it's worth realizing that the issue occurred 2 weeks after the DB update was deployed, and also that they fixed the issue in, what, just over 12 hours after it began.

13 Oct: Release to dev
16 Oct: Release to prod
< more stuff gets deployed, nothing goes wrong >
31 Oct: Runaway failure began/detected around noon, mitigated (ish) around 4pm
1 Nov: Issue resolved just after midnight
posted by Xany at 8:20 PM on November 15, 2021 [5 favorites]

From a reply in the comments:

you're looking at it from the outside - which is always easier. We don't always have the resources to do all the nice things that are written in textbooks. I created this service from the scratch in two weeks, and have since been working on average 16 hours a day to maintain it.

And that right there is where this opinionated ideologue spots what I'm convinced is the real problem here. No way should any critical national service ever be set in up a way that relies on having anybody take on an average 16 hour daily workload.

Even if, as in this instance, they found a lead dev competent enough to track and fix this kind of cascading failure as fast as they did, it's unsustainable. Your best people are the last ones you want to be burning out.
posted by flabdablet at 8:26 PM on November 15, 2021 [19 favorites]

"Don't change anything until you have to" is of course the safest way to avoid this specific kind of problem

c.f. the previous thread about COBOL
posted by hippybear at 8:38 PM on November 15, 2021 [3 favorites]

Yep. "Don't change anything until you have to" is best understood as a flavour of "don't pay for anything until you have to".
posted by flabdablet at 8:40 PM on November 15, 2021 [1 favorite]

There's a comment on this YouTube video that wins today's Internet:

Had to do this to a server just the other day. Younger workers who only grew up with Next Generation Star Trek had no idea how to handle it.

posted by flabdablet at 8:48 PM on November 15, 2021

As much as I like PostgreSQL (which must be a lot since I've been using it for over 20 years now), I've never been one to upgrade major releases until I've had plenty of time to see what kinds of issues other people are having. Even when there are large performance improvements to be had, I still wait since my database is obviously already working well enough.

That said, I'm not going to second guess other people's reasoning as long as there is some reasoning behind their choice and it doesn't fly directly in the face of reality. I'd question the decision if it was for security updates because security updates for PostgreSQL are available for an almost ridiculously long time, but that doesn't seem to be the case here.
posted by wierdo at 8:50 PM on November 15, 2021 [2 favorites]

Interesting writeup, the lead dev is a talented bloke, but... I tried using the covid dashboard a while back and it was utterly unusable. Very slow to do some operations and often I gave up waiting to see if it was going to respond. A major usability problem is there was no "I'm working" indicator, a spinning hourglass or whatever. The issues were so severe I have not been curious enough to go back and see if it has been improved
posted by epo at 12:34 AM on November 16, 2021

> Yep. "Don't change anything until you have to" is best understood as a flavour of "don't pay for anything until you have to".

That's why I included the "but it has a cost as well" which was omitted from the pull quote. I started my career at a large investment company that was largely dependent on DB2 and COBOL at the time, and is probably now using some of that same legacy code long past its 50th birthday. While I was there, we were constantly making difficult decisions about devoting resources to rewrite old code vs. let it do its job until we're all dead, and it's not as simple as "technical debt: bad!"

Sometimes you can put a nice boundary around a well-understood pile of legacy code that lets you design to the interface instead of the code that may or may not have gremlins living in it, but at least has the benefit of having done its job well for longer than you've been alive. Sometimes it's worth paying those grizzled COBOL vets many times what you'd pay a Python programmer to do the same thing just to not have to rush a project to replace legacy code.

The costs pile up, which I acknowledged, but you have to tally those costs, and you have to estimate the benefit from using the newest (or newer) versions of things as well, or you're just trading one set of problems for another. Just because there's some bad half-century-old code lurking in financial firms that we can't get rid of doesn't mean that you should rush upgrades of systems where the oldest component might have been written in the 1990s.

Timely patch and minor version upgrades are good, but moving to a new major Postgres version within a month of its release date based on reasoning that didn't make it into a ~4,000 word blog post is at the very least questionable.
posted by tonycpsu at 3:20 AM on November 16, 2021 [1 favorite]

how much is REALLY, 'don't change anything until you have to because the internet/IT is a straw hut in a fireworks factory and if it's working don't touch a motherfnthing'?

A lot more than you probably would be comfortable with. Like, there's a pretty crucial system where I work that's still running on top of (God help us all) an Access 97 database, and the people responsible for it are just now maybe starting to think about possibly having meetings to try to decide how to replace it. In the mean time, we all just kind of don't look in its direction and hope it doesn't fall over.
posted by Mr. Bad Example at 4:24 AM on November 16, 2021 [1 favorite]

I tried using the covid dashboard a while back and it was utterly unusable. Very slow to do some operations and often I gave up waiting to see if it was going to respond. A major usability problem is there was no "I'm working" indicator, a spinning hourglass or whatever. The issues were so severe I have not been curious enough to go back and see if it has been improved

It's pretty speedy from my experience, could you have been caught up in the problems described in the post?
posted by knapah at 4:25 AM on November 16, 2021

IT is a straw hut in a fireworks factory

but don't you worry about a thing, because the hut gets replaced on a 6-week release cycle and sometimes its doors and windows even end up facing the same way. Which doesn't usually leave a fireworks-laden forklift glitched through any of the walls. Usually. Well, except for that legacy one they always have to make sure stays glitched through the laundry wall, because the accounting system had to get emergency patched to expect that twelve years ago and now nobody gets paid if it doesn't. But it's OK, because the laundry room has fire retardant paint now.
posted by flabdablet at 7:41 AM on November 16, 2021 [3 favorites]

Speaking from experience, one reason to upgrade a system which is working fine and doesn't actually need the upgrade is so you don't wind up in an "upgrade hole" where you can't upgrade anymore because the version you're on is too so old no upgrade path exists to the latest version. Bit me in the ass when there was some security bug that was revealed a few years ago.
posted by ob1quixote at 9:28 AM on November 16, 2021 [2 favorites]

And that right there is where this opinionated ideologue spots what I'm convinced is the real problem here. No way should any critical national service ever be set in up a way that relies on having anybody take on an average 16 hour daily workload.

Have you met Her Majesty's Government?
posted by fullerine at 9:59 AM on November 16, 2021 [2 favorites]

I tried using the covid dashboard a while back and it was utterly unusable.

Hmmm, I retract that, I was obviously looking at a different dashboard which I had misremembered as the UK govt one :-(
posted by epo at 11:10 AM on November 16, 2021

And that right there is where this opinionated ideologue spots what I'm convinced is the real problem here. No way should any critical national service ever be set in up a way that relies on having anybody take on an average 16 hour daily workload.

Even if, as in this instance, they found a lead dev competent enough to track and fix this kind of cascading failure as fast as they did, it's unsustainable. Your best people are the last ones you want to be burning out.

You'd think that when there is a raging pandemic that you are tracking you would build in some redundancy.
posted by srboisvert at 12:17 PM on November 16, 2021 [2 favorites]

Have you met Her Majesty's Government?

No, but I did watch in horror from afar as the worldwide gullibility pandemic sweeping through the US and my own country brought the UK to its knees as well.

You'd think that when there is a raging pandemic that you are tracking you would build in some redundancy.

Any such thought would have to rest on believing that most of the public is clueful enough never to allow the highest offices in the land to be occupied by a malignant waffling clown car of Brexit spivs and chancers like that responsible for managing today's unfolding catastrophes.
posted by flabdablet at 12:52 AM on November 17, 2021 [2 favorites]

« Older Bus Driving In The Marvel Universe | "Most of us weren’t ready for it. Some of us still... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Pouria Hadjibagheri and the Cascade of Doom
November 15, 2021 1:14 AM Subscribe

Tags

Share

Pouria Hadjibagheri and the Cascade of Doom November 15, 2021 1:14 AM Subscribe

Tags

Share

Pouria Hadjibagheri and the Cascade of Doom
November 15, 2021 1:14 AM Subscribe