If You Have Raw Feelings Related to Recent Fires, This Could Be Rough
June 9, 2023 8:00 AM   Subscribe

Fire escapes are a hacky bit of afterthought tacked on to the outside of a building after the building is finished. If you're using fire escapes, it's worth making them as good as possible, but you’ll prevent more fires if you build better buildings. Similarly, incident response is often a hacky bit of afterthought tacked on long after software is released. Again, great incident response can help you recover faster than if you don’t have it but… you’ll prevent more outages if you build better software. Finally, buildings have an extremely detailed fire code, but we don't really have an extremely detailed systems engineering code for software, and I think we should have. from The History of Fire Escapes
posted by chavenet (15 comments total) 19 users marked this as a favorite
 
Oh, this is definitely right in my interest zone. Thanks for posting it. [off to read]
posted by rmd1023 at 8:30 AM on June 9, 2023 [2 favorites]


i can't even look at this without wanting to burst into tears because it has been A Hard Week and in two days i make a 30min presentation on disaster recovery to the top that is a bit like standing outside a crater suggesting perhaps we could have a bucket of water. while on fire. please.

i am going to share this around work.
posted by dorothyisunderwood at 9:06 AM on June 9, 2023 [8 favorites]


Pretty cool - but unfortunately (of course IMO) he doesn't realize that the fire improvements we've made have actually gone too far - don't seem to provide more protection, and make lots of other things worse now, and every level of regulations you add adds more complexity, requires more experts to both create them and verify they continue to work, and they require auditors to verify they are used correctly and regularly.

All of these have downsides and costs, in addition to upsides.

In the fire area, fire departments absolutely refuse to use smaller trucks, so your roadsize in most urban areas is determined by fire trucks. Also there is real debate in the US about loosening the code to allow single-stair buildings vs the US requires dual-stair fire exit points, which is another reason why so many large buildings look the same. The US method doesn't lead to less fire deaths than single stair, at least at the statistical level, vs Europe.


All that would be fine, but the engineer should realize that wider streets equals less room for sidewalks/bikes/pedestrians, so do car accidents kill more people than fires, and what the acceptable death rate for each is a real tradeoff.

Also, my personal opinion is fire escapes are cool, and are worth it to keep around just for the songs about them and movies featuring them.


Software engineering is no different. There are real tradeoffs between security, scalability, design cost, and maintenance, and being the guy that says more is better is cool - but realize that real tradeoffs exist. Multiple data centers for disaster recovery has costs, as does having that and running regular failover tests to make sure it will work when it's necessary. Are those costs worth it vs funding new development? Is increasing your operational budget worth it vs running a bit more lean and having more manual or "sucks if it is lost but not the end of the world" failover models actually acceptable? It really depends on the product.
posted by The_Vegetables at 9:09 AM on June 9, 2023 [9 favorites]


She
posted by toodleydoodley at 9:26 AM on June 9, 2023 [14 favorites]


I'm coming off a week-long security assessment for our product. This article resonated as I work with the architecture folks and software dev team, as well as incident response and business continuity sides. I batch it all together and we get assessed yearly.

One on hand, we're a smaller shop and getting buy-in on security, redundancy, and so on is easier to mandate and implement. Conversely, I can also see the over-requirements to which we're beholden as well as our minimum meeting of requirements.

This article resonated less for the laws and regulations that were put in place and more as a reminder that, as an organization, we benefit from, and better protect our customers and users when, protections are inbuilt and not tacked on or added just for show.
posted by bacalao_y_betun at 10:14 AM on June 9, 2023 [2 favorites]


Also there is real debate in the US about loosening the code to allow single-stair buildings vs the US requires dual-stair fire exit points, which is another reason why so many large buildings look the same. The US method doesn't lead to less fire deaths than single stair, at least at the statistical level, vs Europe.

The trendy single-stair discourse is a pet peeve of mine, especially since it tries to hand-waive the idea of safety precautions in the interest of aesthetics. This article by Kate Wagner does a better job of addressing the it than I ever will, but here goes:

The idea that we don't need multiple points of egress in the US because Europeans get by just fine without them glosses over the fact that all other things are not equal. Not when it comes to building construction, or safety cultures, or financing, or emerging problems like lithium battery fires. The Bronx isn't a borough of Vienna.

The article in the FPP mentions a fire in December 2017 that killed 12 people and was, at the time, the deadliest in NYC in 25 years. That record stood for just four years until January 2022, when a fire in the Bronx killed 17 people. That fire took place in a high-rise building with a single scissor staircase that quickly turned into a chimney full of thick black smoke. All of the deaths in the 2022 fire were the result of smoke inhalation.

The stairwell wasn't the only problem but that's kind of the point. There was the busted heating system that caused residents to rely on space heaters, the self-closing doors that jammed open, a fireproof building designed for people to shelter in place despite the human instinct to flee danger, novel skip-stop architecture that made it difficult for firefighters to navigate the structure, architects who didn't bother to revisit the site for 40 years to see how their pet theories had played out in practice, a buildings department that lacks resources for robust enforcement, an immigrant community with relatively little political or economic power, the privatization of public goods, and of course perverse financial incentives and a widespread culture of corruption and impunity among building developers and landlords in NYC.

Getting back to the FPP, that's why redundancy is important. And just as it's important to understand the costs associated with implementing a safety feature, it's also important to understand the environment in which that feature is situated and the historical reasons for its addition. Less isn't always more and redundancies can save lives.

Speaking of which: if the presentation in the FPP had been made just a year or two later, it almost certainly would have mentioned the 737 MAX under the section about software-related deaths, and perhaps used it as an example of what can happen when software is used to jury rig a bunch of non-software corners that have been cut. Because sometimes a single patch is a poor replacement for overlapping safety systems.
posted by evidenceofabsence at 1:36 PM on June 9, 2023 [12 favorites]


I agree with the gist of the article, but every time I read something in this vein about software the author ignores that the metaphor between physical engineering and software only goes so far.

Some of the gaps in using construction as a metaphor:
- they don't develop new raw construction materials every year
- they don't develop new ways of using those materials to build structures every few days
- there's a limited number of types of structures that have failure modes to examine
- the construction of a building can generally be inspected and its condition understood fairly quickly by a handful of people
- same for whether said building meets existing code requirements

It might be nice if we could put a moratorium on new software constructs until those issues get sorted out (like they do on certain types of construction from time to time), but that ship has not only sailed, it's already sunk, been raised, scrapped, recycled and turned into a new ship several times over.

As much software as there is in the world, I think we're still far from it being mature enough that the forms are basically understood in the way construction is. Not sure we'll ever get there with the pace of change.

I don't know what the answer is, but I do know that there's no fire code for software that could be written that wouldn't be out of date by the time the final draft was published.
posted by Ickster at 6:41 PM on June 9, 2023 [1 favorite]


That argument goes both ways though Ickster. One of the things slowing down novel construction is the difficulty in getting it accepted into code (lenders and insurance companies care).

If there were fewer new libraries and languages in commercial software but the ones we had were more reliable, I would be happy with the trade off. It would be easy to play with novel constructs or work on getting them certified, compared to doing the same with building construction.
posted by clew at 10:05 PM on June 9, 2023 [2 favorites]


Slight derail but the Copan Building in São Paulo, Brazil has the most amazing and insane fire escape I've seen added to the outside of a building. I shudder every time I see it. Imgur links: 1 2 3.

I suppose having it on the outside would mean less chance of it filling up with smoke, but man, I can't imagine a few hundred people per staircase trying to push their way down in a panic.
posted by xdvesper at 10:23 PM on June 9, 2023 [4 favorites]


I once worked in a museum in London where the fire escape was made of wood, and as a bonus it was supporting the lightning conductor from the roof. Oh, and my office was a temporary structure on the roof, reached by a single narrow spiral staircase. And those weren't necessarily the worst things about the building.
posted by 43rdAnd9th at 2:28 AM on June 10, 2023 [3 favorites]


I am not a Site Reliability Engineer & merit no TLA suffix, but this topic is very much in or at least adjacent to my wheelhouse. The wheelhouse isn't a very useful place when you perforce must drift, but at least i know where i am trying to go, even if the observed path is a bit spaghettified.
I'm working on the outline of a "book" tentatively titled What Does It Mean To Trust A Machine? because i like rhymes. This book will serve as the introduction to the documentation of a software library that implements ideas about measuring the "trustworthiness" or reliability of software components, dependency chains and the like.

What surprises me about this talk is that, while the speaker claims that "we" (SRE's or the software industry in general) lack such code(s) or authorities to promulgate and enforce them, I can confidently state that such authorities do exist, and are about as navigable and effective as the ones that apply to construction. There's lots of room for improvement in the regulation of both industries, though in my opinion the major barrier to such improvements isn't the Wicked Problems inherent in codifying "better buildings" or "better software", but the immunity of large operators, market-making vendors and regulatory captors from any regulation
posted by Rev. Irreverent Revenant at 11:51 AM on June 10, 2023 [4 favorites]


So I'm probably forwarding this to a colleague or two in spite of some of the shortcomings mentioned above.

I used to live in some german student housing that had 10 floors and only an escalator that was built before I was born and a concrete stairwell. I only lived on the 5th floor, but I looked at the failure modes and found a window I could open and bought myself some rappelling rope. I never needed to use it and I agree with that it would only have worked for me, not panicky or less rapelling-experienced others (maybe not even panicky me).

Similarly, a few years back some critical system at work died and we only found out because users came complaining to us. We had monitoring in place, but I found out that they were looking at a LOT more systems: in fact, if our whole subdivision had failed all their checks, apparently, it would have been a 10x10 pixel red sector somewhere on their screen. And I still find out that some of our changes, like new systems were never reflected there.

So I bought code rope. There's now a very silly powershell *) script monitoring several hundred URLs and hard disks and even system cpu and memory loads that'll send e mail and update a website. It's SPECIFICALLY crafted to send mail as little as possible but I'm fairly sure almost everybody but me has those redirected directly to trash.

But my real goal is to make it so software will fail more elegantly.

There's a school of thought that goes: if something goes wtong, fail as hard and early as possible, so software engineers will fix it ever happening again.

Opposed to that, my own school of thought is that if you throw my software out of a plane, it'll raise screaming hell all the way down, but if it survives the landing it'll climb right into the next plane and go again. And we have somewhat decent build pipelines with somewhat decent automated tests and peer review.

Split things into parts that may or may not fail. If they fail, report then see whether you can still do the next thing or not. Sometimes it's even still fine not to climb into that plane at all, to stop working entirely, e. g. when during startup a parachute or config is missing or wrong.

Now, I think I'm doing ok on 1 prevention and 2 detection but I'll be thinking about 3 isolation and 4 response (esp. since I'm not a dedicated firefighter, much unlike some of my colleagues). And find out whether we have SREs.

*) I kinda regret that but ok. Someone challenged me and I have trouble resisting challenges. Oh well, I hear you can run powershell on linux now.
posted by flamewise at 12:15 PM on June 10, 2023 [3 favorites]


I dove into this talk when I realized it was another from the author of the fantastic talk "Being Glue", which I think everyone—especially in tech—should read.

The most resonant thing to me about this—and I've seen this endlessly in my career in software—is that people who are good at (and willing to do) firefighting are highly valued by organizations, whereas people who take the time to build fireproof software are more likely to get grief for not moving fast enough. "Hey, remember all those outages we didn't have?" is a tough sell in a performance review.

In 2017 I was at a job where SVB was a single point of failure for a business critical function and I exhorted our leadership to sign off on a project to add automatic fail-over, but was told that SVB would never fail, and that if it did, it wouldn't be our business getting blamed. Two jobs later, there was national news coverage of that specific company's peril in the face of the SVB collapse and I saw all the #hugops posts praising the heroic effort of the team to implement what I had suggested 5 years prior over the course of the weekend before the Fed bailed SVB out. Had we done what I suggested and I had stayed at that company, maybe I'd have gotten some kudos, but I doubt it, since everyone who told me "no" at the time is long gone. IBGYBG may not be an explicit mantra in software, but it's sure the reality in my experience.

Honestly, it's a tough problem. People are naturally attuned to salient and exciting factors, but when it comes to creating reliable software, my lodestar is Prop Joe: "keep it dead fucking boring".
posted by Cogito at 2:49 PM on June 10, 2023 [3 favorites]


> The idea that we don't need multiple points of egress in the US because Europeans get by just fine without them

My vague understanding is that Europe still requires a second point of egress, it's just that the code allows for the window to count as that second point as long as it's within reach of your local fire department's ladders. In which case, anyone saying "we don't need multiple points of egress" has badly misunderstood their own argument and shouldn't be trusted.
posted by vibratory manner of working at 11:26 PM on June 10, 2023 [2 favorites]


Cogito: The most resonant thing to me about this ... is that people who are good at (and willing to do) firefighting are highly valued by organizations, whereas people who take the time to build fireproof software are more likely to get grief for not moving fast enough. "Hey, remember all those outages we didn't have?" is a tough sell in a performance review.

I'm against firefighting; self-recovery is the goal. We have metrics for failed deployments and rollbacks plus cadence of new releases (e.g. DORA metrics) which you can bring to a performance review: a low incident count and high rate of releases should be something your organisation can value ...or they don't value reliable systems.

Some places need you to look busy, but I've got a list as long as my arm for engineering improvements that I'm implementing while we make the system stable and release at high cadence. If I'm not busy, I'll get bored and take my institutional knowledge out of the organisation.

To address the thread's link -- good content, how do we publish a checklist of patterns and paradigms that make it cheap to get cybersecurity insurance if your build pipelines record evidence that a given release ticks boxes on the checklist (because doing this manually is a bullsh_t job)..?
posted by k3ninho at 2:14 PM on June 11, 2023 [4 favorites]


« Older It's not about politics; it's about life.   |   "I am not your Fleshlight." Newer »


This thread has been archived and is closed to new comments