Know When To Run
January 25, 2015 1:17 AM Subscribe

Over Christmas engineering works on train lines into London failed. This is a review of the report into that failure. It's a fascinating read about cascading failure and errors in project planning. And, for once, read the comments.
posted by Gilgongo (50 comments total) 63 users marked this as a favorite

I travelled from Edinburgh to Finsbury Park on the 27th. Since we boarded our scheduled service, with seat reservations, at the start of its route, we were spectacularly lucky. There were a lot of people standing or sitting in the aisles further down the route.
posted by psolo at 1:44 AM on January 25, 2015

Great example of the planning fallacy in action. Its the first year university student who creates a big timetable mapping out when he is going to do revision for all of his exams and leaves sunday as a "day off".
posted by Another Fine Product From The Nonsense Factory at 1:46 AM on January 25, 2015 [4 favorites]

Site's down for me
posted by clorox at 2:15 AM on January 25, 2015

There is a Google cache of the text
Know When To Run: The Story Behind The Xmas Kings Cross Problems
posted by rubber duck at 2:17 AM on January 25, 2015 [2 favorites]

The comments are full of train geeks nattering about track redesign. They kept referencing "carto metro", which I looked up. Gosh. Train geeks, I love you and will never say a harsh word about you.
posted by Joe in Australia at 2:24 AM on January 25, 2015 [18 favorites]

This was a great read, and from the comments - an article on decision making and the 'Rule of Three': The proposal is that three marginal conditions should be considered as equivalent to a single exceeded limit when deciding to halt operations. I can see that being a useful tool in all sorts of project management.
posted by Gin and Broadband at 2:28 AM on January 25, 2015 [28 favorites]

Incidentally, my rule of thumb is that I allow timetables one degree of slippage. If a train or plane is delayed by an hour, fair enough. If that one hour becomes an hour and a half, it's odds-on that it will continue to be delayed and may even be cancelled altogether. This rule has correctly guided me to seek accommodation or alternative transport many times, and has saved me a good deal of trouble.
posted by Joe in Australia at 2:28 AM on January 25, 2015 [13 favorites]

"The new log grabs had never been used on these particular, or indeed any, RRVs before and it soon became apparent that – for reasons as yet unclear – the fittings didn’t fully match."

Typical. That is absolutely typical of the modern world.
posted by rubber duck at 2:30 AM on January 25, 2015 [1 favorite]

I planned to travel through King's Cross on the 27th, but thankfully in the evening when the load had subsided. The main impact on my journey was that I was standing all the way between Bristol and London, despite having reserved seats. Irritating, but it's what I've come to expect when doing the Christmas run.

It's interesting to read the report, and I'm surprised how quick it's been to come out. That said, relying entirely on brand new, untested, equipment for a mission-critical job is just insane. Someone needs to tell them about bathtub failure curves.
posted by YAMWAK at 3:00 AM on January 25, 2015

The basic problem is that they just don't care enough about disruption. They should have allowed far more margin for error in order to ensure that travellers weren't put through hours of misery. Of course they tried to avoid it, but they were prepared to take ridiculous levels of risk, and do staggeringly little by way of mitigation. Shunting hundreds of people in a disorganised way through a distant and unsuitable station with reduced service and no information would have amounted to acceptable, capable coping in their eyes if not for the media attention. In fact it still does, really.

Nothing will change until some of the people who make these careless, disengaged decisions with no human empathy at all are sacked, or jailed. Alright, maybe not jailed, but when you've been treated like this for many years on London trains it's hard to think about this stuff temperately.
posted by Segundus at 3:13 AM on January 25, 2015 [4 favorites]

Mind the gap.
posted by breadbox at 3:25 AM on January 25, 2015

Thanks for this post; the article explains a lot. This snafu affected car travel, too. My partner and I were stuck in traffic for 7 hours, trying to head up the A1 (just north of London) to the North (York) on the 27th. The traffic problems cascaded the whole way, getting worse in the parts where the roads couldn't handle the overflow. It seemed worse heading into London. We didn't have radio in the car, but the tv monitors in the service station displayed an utter snarl of transportation hell. We both had massive colds and filled our car with snotty tissues as we attempted to entertain ourselves for hours with stupid puns about congestion.
posted by iamkimiam at 3:36 AM on January 25, 2015 [2 favorites]

That is the most civilized comment section I've ever seen.
posted by MexicanYenta at 3:41 AM on January 25, 2015 [13 favorites]

The recommendations seem so obvious that I'm not surprised that no one thought about implementing them. There seems to be a fundamental lack of quality thinking that manifests itself in all kinds of quality issues. It's like we still haven't figured out the nature of quality at a fundamental level that would enable us to prevent these errors from happening in the first. Heuristics like The Rule of Three are fascinating because they suggest that practicing quality doesn't necessarily requires complex tools and frameworks. Seems significant that Rule of Three approaches quality from a psychological perspective and not a technological or financial one.
posted by Foci for Analysis at 4:16 AM on January 25, 2015 [2 favorites]

"That said, relying entirely on brand new, untested, equipment for a mission-critical job is just insane. Someone needs to tell them about bathtub failure curves."

I was going to post almost exactly this. It's a surprising lapse for NR who most certainly know about bathtub curves, or at least they did when I worked on reliability modelling for them.
posted by Just this guy, y'know at 4:28 AM on January 25, 2015 [1 favorite]

As a software tester, I'm both amused and saddened by the fact that "test the equipment before it's put into production" is point 2 on the list of recommendations. I have a feeling it was recommended on this project as well but ignored because Artificial Deadline "had" to be met.
posted by Sheydem-tants at 4:40 AM on January 25, 2015 [3 favorites]

Great read(s). In a wider context, many of the problems described in the article and report are cascading errors took place in the context of a wider system that had been hollowed out from within (in the name of efficiency), reducing staffing and skill levels to what someone somewhere had previously decided to be an acceptable minimum. For instance, there were not enough drivers on site, and not enough platform staff once the problems had arisen. While minimal staffing may work for day-to-day ops, for non-standard ops with problems arising, this quickly falls apart. But that's a choice you make. This does not seem to have made its way into the report's recs however, which are mainly of the typical 'mistakes were made' variety.
posted by carter at 5:43 AM on January 25, 2015 [6 favorites]

Sheydem-tants, I have said, in progress meetings, "I am writing tests for the piece of functionality that broke last week, that went into production without tests, that broke our entire service" (paraphrased) and immediately had a project manager say "Tests? It's working now, why do we need tests? Can't you work on New Feature X?" Even now, nearly two decades after automated testing started to come into its own and be a respectable software discipline, respected(?) members of the community still argue against the lessons we've learned.

The moral? It is easy to argue about best practices that "cost too much" until you see what it costs you not to follow them. The same is true for me, of course—I didn't value a fast test suite until I worked on projects without one.
posted by sonic meat machine at 5:43 AM on January 25, 2015 [4 favorites]

In a wider context, many of the problems described in the article and report are cascading errors took place in the context of a wider system that had been hollowed out from within (in the name of efficiency), reducing staffing and skill levels to what someone somewhere had previously decided to be an acceptable minimum.

We have a lot of systems (both institutional and infrastructural) that we have chosen to hollow out, gambling that we won't need the resiliency of a more robust system. That works right up until it doesn't, and large scale failures are probably the cost to pay for choosing to not invest in the systems.
posted by Dip Flash at 6:10 AM on January 25, 2015 [3 favorites]

Excellent advice and a Dickens reference to boot:

[T]he problems with the people on the ground declaring an overrun (or not when they should have) is well known. Whilst I can forgive almost everything else as bad luck or just overlooked and not fully understood and appreciated it is well known that you should NEVER allow the people on the ground to decide when to announce such things as they will inevitably adopt a Mr Micawber approach and hope and believe that “something will turn up” that will save the day.

What should have happened is that for each period there should be fixed instructions detailing when, at the latest, an overrun MUST be declared e.g. If more than a hour behind schedule on Christmas Day (even accounting for all contingency time being used) then an overrun MUST be declared immediately.
posted by CheeseDigestsAll at 6:20 AM on January 25, 2015 [1 favorite]

Reliability is invisible, and who wants to spend resources on invisible things? That's why insurance is so often legally mandated.

Reliability isn't invisible to engineers, of course, but they don't get to spend the money. They can help plan and advise, of course, but only those who are full-time management get to decide. And even if you were once an engineer, once you're in full-time management you are drifting away from engineering. You start to have other concerns. Reliability starts to become invisible - until the lack of it becomes very visible indeed.

Nothing new here, sadly. The cultural driver behind it is very deep. It is getting better - the UK, at least, has become much better at managing very large capital projects - but it's still a long way away from being part of established managerial thinking.
posted by Devonian at 6:21 AM on January 25, 2015 [6 favorites]

Article author chipping in. (As I've mentioned on here before, John Bull is my pen-name).

There's an awful lot of overlap between software/web project management and engineering and rail project management engineering. Ultimately infrastructure is infrastructure and systems are systems, whether digital or physical. It's always fascinating to see how the same problems appear in both worlds - normally a complete failure to appreciate the need for redundancy and careful failure management until it happens.

The moment the Christmas delays happened I remember thinking that the "what went wrong" was going to be fascinating and once the report came out, and from the feedback we got behind the scenes from those who were involved in some way, it was a no-brainer to write it up.

We have a rule against editorialising on LR, so I kept my personal feelings out of the post. But what I found most interesting was that 90% of their issues really come back to no one in either local or senior management being brave enough, or having enough authority, to step back and go:

"No. Wait. This is no longer the project we planned to do. We need to step back and reassess it all."

As was pointed out in the comments, this is the Rule of Three problem. But I've always colloquially known it as the "Kenny Rogers Problem" (which influenced the article title):

"You've got to know when to hold them, know when to fold them, know when to walk away..."

That is the most civilized comment section I've ever seen

Thanks. It is in part modelled on Metafilter in terms of what we expect/tolerate in terms of tone and discussion.

It takes an awful lot of work to keep it that way. We have two volunteer mods in different time zones quietly, but politely killing anything we don't feel fits the site tone. It's worth it though. I'm genuinely proud of the level of commentary we get - and that you're just as likely to see one of the senior execs of TfL or Network Rail pop up in the comments (sometimes anonymously, sometimes not) and a whole raft of people who are either current or ex rail staff and really know what they're talking about. It not only keeps us honest, but means the discussion can get very interesting.

The first comment on this post about Thatcher's plan to get rid of Marylebone remain one of my favourite things on LR, hell the internet, ever and this one from further down runs it a close second.

_{Also this is my second fping in as many weeks. My head is totally not going to fit through the door now.}
posted by garius at 6:45 AM on January 25, 2015 [84 favorites]

> I have a feeling it was recommended on this project as well but ignored because
> Artificial Deadline "had" to be met.

You can expect that first enormous climate-change geoengineering project to have one of those too. Count on it.
posted by jfuller at 7:05 AM on January 25, 2015

garius, I think "...90% of their issues really come back to no one in either local or senior management being brave enough, or having enough authority..." is a little bit wrong, although "having enough authority" is close to it. The problem is more fundamental than that: all it takes is one link in the chain of command to destroy the possibility of a "brave" action.

Consider this management hierarchy:

Programmer → Manager → Director → VP → CIO → CEO

This was my management chain at a previous company through several project iterations (with different people in the different roles at different times). I experienced, at different times, a manager who ignored any of my concerns; a manager who listened, but a director who didn't; a manager and director who listened, but a VP who didn't; and a manager, director, and VP who listened, but a CIO who didn't. At one point, I even had the ear of a VP but not my Manager. This project ended up being a massive money-sink for the company as long as I was there (I left for greener pastures as soon as it became clear even the CIO was incompetent).

All it takes is one link in the chain to think the engineer or programmer is being alarmist, or to think that his great vision will Save The Day, and to Hell with the guys who say we need to rewrite the (comparatively tiny) core module that is causing us horrible problems but is politically squirreled away in the domain of our most incompetent programmer... Bruce Webster calls this the Thermocline of Truth, and I think part of the reason flatter, more agile companies tend to do better in technology is simply because there are fewer opportunities for an idiot to sink the ship.
posted by sonic meat machine at 7:05 AM on January 25, 2015 [15 favorites]

Very true - you're right.

That Thermocline of Truth link is absolutely on the money and again could absolutely be applied to Rail engineering as much as IT.
posted by garius at 7:13 AM on January 25, 2015 [1 favorite]

Why isn't there a way for someone or some committee way up in authority to declare an emergency, and let people go past shift limits up to 50% for 3 days max or something? This would have stopped a lot of stuff from going wrong.Why isn't there a way for someone or some committee way up in authority to declare an emergency, and let people go past shift limits up to 50% for 3 days max or something? This would have stopped a lot of stuff from going wrong.

Because the railways have "safety culture" ingrained into their very bones: It's simply unthinkable. Nobody in management would order it & the train drivers would refuse to drive and their union would back them up - the train drivers union (the RMT) is one of the most powerful unions in the UK.

Asking train drivers to work over their shift hours is inviting an accident involving hundreds of tons of fast moving heavy machinery that could easily kill hundreds of people. It's just not going to happen.
posted by pharm at 7:23 AM on January 25, 2015 [19 favorites]

Why isn't there a way for someone or some committee way up in authority to declare an emergency, and let people go past shift limits up to 50% for 3 days max or something? This would have stopped a lot of stuff from going wrong.

The extension would be planned into the contingency and become standard. So basically there would be an emergency declared every time there was any problem, with attendant safety issues arising from overworked staff, who would effectively be being endangered by cost engineering. The unions wouldn't take it, as pharm says, and why should they?
posted by biffa at 8:08 AM on January 25, 2015 [7 favorites]

Even now, nearly two decades after automated testing started to come into its own and be a respectable software discipline, respected(?) members of the community still argue against the lessons we've learned.

Seems to me he's arguing against fundamentalist forms of TDD programming, not automated testing in itself.
posted by effbot at 8:25 AM on January 25, 2015

effbot, the problem is that it amounts to the same thing. Another article he wrote (and speeches he's given) argues against "perverting your design for testing," which actually just means... making your design testable (which itself has other benefits). Rails (DHH's baby) is also notorious for making it hard to test vast swaths of functionality, and next to impossible to avoid database coupling.
posted by sonic meat machine at 8:38 AM on January 25, 2015

The comments are good, especially this one from Fandroid.

The media response to the Kings X problems reminded strongly of the old joke about the 6 stages of a project:

Enthusiasm
Disillusionment
Panic
Search for the Guilty
Punishment of the Innocent
Rewarding those who took no part
posted by selfmedicating at 8:56 AM on January 25, 2015 [15 favorites]

Reading the Thermocline of Truth I keep thinking of the Columbia report.

This is why the " they just don't care enough about disruption" understanding is off-base. Have a look at any major NASA failure reports. Simply caring more or throwing more resources at the problem or doing more testing doesn't significantly improve an organization's ability to accomplish projects that involve significant risks. Obviously these components are necessary, but they're not sufficient. It's way harder and way more elusive to set up your organization to properly transmit and handle bad news and negative results, but these are key elements of actually managing risk. Failures will happen regardless of testing or planning or caring or schedule buffer; recovering is the crucial skill.
posted by kiltedtaco at 9:05 AM on January 25, 2015 [2 favorites]

Why isn't there a way for someone or some committee way up in authority to declare an emergency

You'd have to design a definition of 'emergency' to begin with, and then you'd also have to design a process for figuring out if an event actually was an emergency. Then, once the framework was in place, you'd have to assess whether or not any event qualified. Plus in this case you'd have to get a bunch of people together over the Xmas break to carry out this assessment.

It's not that it's not doable, but it's probably easier to design the work plan so this doesn't need to happen. Hopefully the right 'lessons learned' will be drawn from this for future work.
posted by carter at 9:13 AM on January 25, 2015

I'm an engineer myself, and honestly, while it was certainly a screw-up, I don't think it's the screw-up of colossal proportions that people are making it out to be. Logistics are hard, proper engineering is hard, and the results of the screw-up were that thousands of people were inconvenienced - no one died and nothing was destroyed.

Do note that there were several places in the story where they could have sacrificed safety for speed and they did not. And if they had, we would not have known about it - perhaps forever, if the culprits were lucky, or not for years regardless. Compare and contrast to the Challenger...

In ten years, when instead of digging it up and redoing it they still have another 10+ years of use out of this year's work, those "same" commuters are going to get a break.

Again, a screw-up, but in the real world, there are screw-ups, and what matters is that you get through them in one piece. I'd definitely have a beer with the people responsible, whereas I couldn't be in the same room with the managers whose spinelessness orchestrated the Challenger disaster.
posted by lupus_yonderboy at 9:38 AM on January 25, 2015 [9 favorites]

“Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.” — Douglas Hofstadter
posted by ob1quixote at 9:48 AM on January 25, 2015 [3 favorites]

Again, a screw-up, but in the real world, there are screw-ups, and what matters is that you get through them in one piece.

I think you have to separate out the engineering from the operational failures here. On the engineering side, yes the failures have a certain element of "these things happen" about them. Ultimately the work got done, nobody died and it was only a day late.

What shouldn't have happened though is the level of disruption that resulted on the operational side as a result. It shouldn't have been the case that a full 24hr delay wasn't even considered a risk at all. If it had then the experience of passengers and indeed staff would have been sub-par, but it wouldn't have been chaotic and awful.

Indeed although I don't make an enormous thing of it in the article, reading between the lines of the actual report it's pretty clear that Network Rail (who ultimately are only responsible for the infrastructure, not the trains) are kicking themselves for being somewhat complacent in their dealings with the Train Operating Companies about how services would be handled if stuff went wrong.

They should have pushed and led more on making sure a proper worst case alternate timetable was planned, because the simple fact is that when it all did go wrong the TOCs just weren't able to cope because they had neither the station staff nor senior planners ready (i.e. the people who do timetables) to handle the consequences of a full 24hr delay.
posted by garius at 10:03 AM on January 25, 2015

Having been hit by weather systems caused through the Thermocline of Truth more than once in the past, I'm delighted to make its acquaintance by name. That alone makes this thread worth the price of admission - but on top of that, there's the LR write-up of the LKX debacle and Graham H's response... I know the LR website and try to avoid it, because it's a guaranteed multi-hour diversion from whatever it is I should be doing, but as a regular ECML long-distance punter, this was a treat. (Dreading the Dawn of Beardie, though...)

I do wonder whether the move to devops thinking - and the bigger move to BPR along the same lines - might help with this. Once you've got the sort of automated metrics and cross-silo visibility that you have to have for that kind of pipeline, and once you get away from the Big Bang mentality, it becomes hard if not impossible to hide bad news from anyone who wants to look - and upper layers of management will find it harder to abdicate responsibility in the guise of "Just get it DONE and don't tell me bad news that's YOUR fault" underling spankage. Infrastructure projects might seem harder to instrument than pure software dev, but that's a diminishing concern because of a whole range of trends linked (but not exclusively) to supply chain management and IoT changes. I suppose you could call it EngOps.

(Disclosure: I used to be a devops cynic, but now have something of the dangerous enthusiasm of a convert)
posted by Devonian at 10:11 AM on January 25, 2015

but on top of that, there's the LR write-up of the LKX debacle and Graham H's response

Which is even better once you realise that the Minister pointing randomly at bits of infrastructure to close in his A-Z is George Osborne's Father-In-Law.
posted by garius at 10:50 AM on January 25, 2015 [1 favorite]

Typical. That is absolutely typical of the modern world.

New tool mistake. Whenever you are embarking on a long, complicated thing, like a rail rebuild, a multi-day hike, running a marathon, whatever, you never bring new gear. Ever. You bring gear that you know works.

New gear gets used first on easy, unimportant things. Once it's proven out, then you use it for the hard jobs.
posted by eriko at 10:54 AM on January 25, 2015 [10 favorites]

It seems to me there should be no "good news" or "bad news", just "news". Obviously there is going to be some that people want to hear and some they don't, but there should be no penalty for "bad" news and no bonus for "good news".

In addition to the hollowing out of things in the name of "efficiency" we also have a culture where "just get it done" is an OK management edict even if the resources are not then allocated to "just get it done."

The rule of three is an excellent concept. It would have to be an absolute, though. I can easily see the colors being changed on the fly because you cannot be a hero if you don't take risks, and only heroes get bonuses and good reviews.
posted by maxwelton at 11:08 AM on January 25, 2015

I've seen a lot of people say engineering is hard. And while that is true, I think people don't alwys get that operations is just as hard. Garius is right, you need to seperate the operational side from the engineering to see the issues more clearly.
As someone who spends my time in both worlds in IT, it amazes me how most management acknowledges the former but does not comprehend the latter. I can't count how many times I've been in meetings where I hear, "but that's just normal operations" or "that's what these guys do everyday."
posted by herda05 at 12:21 PM on January 25, 2015 [2 favorites]

I liked the way the article identifies the supply of train drivers as a major bottleneck. Drivers are a classic example of an essential resource that simply can't be increased in the short term. "Idle" workers are usually the first thing trimmed from budgets, but events like this show how important it is to have reserve capacity.

Driver time is special in another way. Once a driver's shift starts, it starts, which means that using a driver to (e.g.) shift an engine out of the way is effectively consuming that driver's entire shift. It looks as though this was the accelerator pedal to the project slippage: once they changed their plans they had to juggle trains to keep the lines clear, which meant putting drivers on shift to fix minor resource conflicts. This helped in the short term, but it was consuming shifts that they'd need to do the actual work. So the actions they were taking to reduce slippage actually caused massive slippage later.
posted by Joe in Australia at 3:16 PM on January 25, 2015 [2 favorites]

Always use the Estimator's Rule: It always takes longer than estimated, even using the Estimator's Rule.

this is the only recursive joke I know
posted by charlie don't surf at 3:22 PM on January 25, 2015

ob1quixote: “Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.” — Douglas Hofstadter

What a horrible saying; I heard it once and I've never been able to finish anything since.
posted by traveler_ at 6:08 PM on January 25, 2015 [1 favorite]

Howell, who had clearly had his fortune told by No 10

What does this delightful turn of phrase mean? I presume it's not a compliment.
posted by Standard Orange at 9:13 PM on January 25, 2015

I think the implication is that Thatcher called him into No. 10 Downing St (the British Prime Minister's official residence), sat down, peered into a crystal ball and predicted "you are shortly going to have a meeting with the press. You will announce plans to convert railways to roads. You will be thoroughly persuasive, and will subsequently receive a promotion. Or so the spirits tell me."
posted by Joe in Australia at 10:13 PM on January 25, 2015 [1 favorite]

That makes sense, thanks. Wikipedia: 10 Downing Street, colloquially known in the United Kingdom as "Number 10"
posted by Standard Orange at 10:33 PM on January 25, 2015

no one died and nothing was destroyed.

Only Christmas, dude.

'East Coast Main Line - you won't actually die'.
posted by Segundus at 1:33 AM on January 26, 2015 [2 favorites]

Hofstadter's Law:

That is probably where I heard this. I was just recently thinking I should read EGB again.
posted by charlie don't surf at 7:23 AM on January 26, 2015 [1 favorite]

« Older Inside Amsterdam's efforts to become a smart city | Fish and CHiPs all over the place Newer »

This thread has been archived and is closed to new comments

MetaFilter

Know When To Run
January 25, 2015 1:17 AM Subscribe

Tags

Share

Know When To Run January 25, 2015 1:17 AM Subscribe

Tags

Share

Know When To Run
January 25, 2015 1:17 AM Subscribe