A Junior Software Developer has a really bad first day on a new job
June 3, 2017 3:37 PM   Subscribe

Fortunately, r/cscareerquestions has their back. (reddit)
posted by chaoticgood (75 comments total) 32 users marked this as a favorite
 
If your newest hire can accidentally kill prod on their first day, that's not the hire's fault, that's the company's.

This kid dodged accidentally dodged a serious career bullet. Getting fired from a job where (1) you can do that at all, and (2) they reacted the way they did was the best thing that could have happened to him.
posted by mhoye at 3:41 PM on June 3, 2017 [53 favorites]


That is a *bad* first day ...

I had a vendor's tech once remote in and delete the database. (He was supposed to be helping me figure something out, and I wasn't quick enough to disconnect him when I saw what he was doing.) Fortunately it just meant all the users had to reset their passwords on the next access.
posted by oheso at 3:43 PM on June 3, 2017 [1 favorite]


Why would you use production values instead of dummy values in your docs?!?!?!? Also, the CTO is a colossal asshole.
posted by Foci for Analysis at 3:44 PM on June 3, 2017 [13 favorites]


From the thread: Humans are just apes with bigger computers.
posted by oheso at 3:45 PM on June 3, 2017 [5 favorites]


Small businesses and startups are like little bastions of crazy.

The number of companies that put credentials like this in google docs are shocking. Putting these creds in an onboarding doc is lunacy.

The interesting thing here is how hard this can be to explain to a certain type of business person. People who aren't technical often don't see the process issue in situations like this. They think "if the employee wasn't careless this wouldn't have happened, it hasn't happened up to now, and we're a big success!" These people are typically the sort of people that are good at some other field that doesn't depend on process for success.

Obviously, in an ideal world that CTO of that company would be fired ASAP, but if it's already this bad that's doubtful.

Fortunately for the poor sap this happened to any reputable employer would hear this story, laugh, and not hold it against them as an applicant.
posted by sp160n at 3:48 PM on June 3, 2017 [9 favorites]


"instead of copying the values outputted by the tool, i instead for whatever reason used the values the document had"

I have witnessed a seasoned software architect do the same sort of thing. Oh, and at least four senior MTS types. Funny thing was that the document had, in big bold letters, DO NOT USE THESE VALUES. EXAMPLE ONLY.
posted by Jessica Savitch's Coke Spoon at 3:54 PM on June 3, 2017 [2 favorites]


I don't know if anyone else uses Terraform, but when you tear down everything, it prints these log messages about destroying "data resources", i.e. things it didn't create, like the AWS security groups that let pieces of your production infrastructure talk to each other. And since it's irritatingly easy to get Terraform into a state where your best option is to tear everything down and start over, I've made more than one trip to the AWS console to reassure myself that I hadn't just brought down all of production.
posted by hoyland at 3:54 PM on June 3, 2017 [2 favorites]


Just recently started my first software engineering internship and have been beating myself up over small things (mishearing someone's name, pushing a door that's supposed to slide, being generally too quiet and awkward and reserved...)

Nice to remember that it really could be worse.
posted by perplexion at 4:07 PM on June 3, 2017 [13 favorites]


I am not a computer engineer or software magician, but what you're telling me is that this big important computer thing was put into the hands of an entirely new person with a big ol SELF DESTRUCT button on it?
posted by FirstMateKate at 4:13 PM on June 3, 2017 [24 favorites]


Not only that, they essentially verbally told him "don't push this button" but the instruction document they handed him said, "PUSH THE BUTTON."
posted by Justinian at 4:14 PM on June 3, 2017 [46 favorites]


It didn't even need a big, red, candylike button.
posted by rhizome at 4:14 PM on June 3, 2017 [6 favorites]


These people are idiots who deserve to have their fucking database wiped by a newbie. How the hell did he have access rights to a production database?! I've been on my current job for four years as a test engineer and I don't have the keys to the production servers and I don't want them.
posted by octothorpe at 4:26 PM on June 3, 2017 [23 favorites]


Wow. I wonder if this will somehow end up being a benefit to this poor guy.
posted by bz at 4:37 PM on June 3, 2017 [1 favorite]


I wonder if this will somehow end up being a benefit to this poor guy.
It definitely is in the sense that the junior developer won't have to work for a terrible, terrible company. Because that level of incompetent and obnoxious can't happen in a vacuum. Plus, it will be a funny story some day.
posted by ArbitraryAndCapricious at 4:47 PM on June 3, 2017 [3 favorites]


It definitely is in the sense that the junior developer won't have to work for a terrible, terrible company.

Also, junior developer is only junior developer for now. Decades pass--they really do. Junior developer has a good chance of someday being a lead type who--here's hoping--will have the sensitivity to read documentation and say, 'Yo, how about we *don't* include production credentials in the documentation that goes to everyone who has been here 24 hours' and will have the decency to be kind to whatever mistake someone else makes, new and raw, trying to do their best.

There are flagrantly dumb, I don't-give-a-fuck mistakes, and mistakes made out of hubris, and then there are mistakes where you can see how someone got there and jeez, at some point you need to know the difference. Devs have a lot of responsibility and a lot of pressure and are answerable to people who don't necessarily even know entirely what they do. At some point, you have to recognize your forest is built of humans and you'd better have a decent idea of what guides the actions of individuals, i.e., don't give the kid your prod DB passwords and access, ya dummy.
posted by A Terrible Llama at 5:02 PM on June 3, 2017 [28 favorites]


r/w/d rights are a privilege, not a right. They need to be earned.
posted by Samizdata at 5:16 PM on June 3, 2017 [6 favorites]


Welcome to NORAD. Your keys will look just like these keys.
posted by adept256 at 5:30 PM on June 3, 2017 [7 favorites]


Ah, yes, do not hit this button. Reminds me of a fun story from this Wired article about the amazing Margaret Hamilton.

(The computer they are talking about is the command module computer on Apollo. You know, the important one.)

"As she toyed with the keyboard, an error message popped up. Lauren had crashed the simulator by somehow launching a prelaunch program called P01 while the simulator was in midflight. There was no reason an astronaut would ever do this, but nonetheless, Hamilton wanted to add code to prevent the crash. That idea was overruled by NASA. “We had been told many times that astronauts would not make any mistakes,” she says. “They were trained to be perfect.” So instead, Hamilton created a program note—an add-on to the program’s documentation that would be available to NASA engineers and the astronauts: “Do not select P01 during flight,” it said."

Ahhhhh, can you guess what happened?

"Right around Christmas 1968—five days into the historic Apollo 8 flight, which brought astronauts to the moon for the first-ever manned orbit—the astronaut Jim Lovell inadvertently selected P01 during flight."

Whoops. But what did this actually do?

"Launching the P01 program had wiped out all the navigation data Lovell had been collecting. That was a problem. Without that data, the Apollo computer wouldn’t be able to figure out how to get the astronauts home. "

Eek! Fortunately, Hamilton and the MIT team managed, after a very tense nine hours reading the code, to work out how to upload the data again and everyone came home.

Long story short, if some of the most highly trained steely-eyed missile-men in the world can inadvertently fat-finger a command that risks stranding them away from their home planet, then maybe you don't hand the production DB to a first-day dev.
posted by nfalkner at 6:03 PM on June 3, 2017 [157 favorites]


The number of companies that put credentials like this in google docs are shocking. Putting these creds in an onboarding doc is lunacy.

Yeah I wasn't clear on the size of the company here but having been a junior dev at some really small, seat-of-the-pants ones hearing people say "I've been at my job six years and they only just let me smell the production password" makes me chuckle sadly. But even those places we had some kind of backups. And I haven't seen new people handed access to the real-deal server without being told what it is - more like new people being handed access to the real-deal server and told "it's all yours, don't do anything dumb or we'll kill you."
posted by atoxyl at 6:06 PM on June 3, 2017 [2 favorites]


Long story short, if some of the most highly trained steely-eyed missile-men in the world can inadvertently fat-finger a command that risks stranding them away from their home planet, then maybe you don't hand the production DB to a first-day dev.

Yup. The problem with developing idiot proof tools is they keep making better idiots.
posted by Samizdata at 6:06 PM on June 3, 2017 [3 favorites]


I have been a geek for 20 some odd years now and I STILL don't want (or need) prod access. Serves them right for being jerks with no backups.
posted by LuckyMonkey21 at 6:08 PM on June 3, 2017 [4 favorites]


Letting the new guy drop the prod database is dumb but it's not the worst thing about this story. Firing him is.

Reacting like this is how you guaran-damn-tee that the next-newest-person isn't going to tell anyone when they fuck up, they'll just panic and scramble to fix it quietly. They might even do something risky to fix it since, hey, they're already fucked if someone finds out. Besides multiplying the risk of one of those mistakes or fixes mushrooming into a truly colossal fuckup, you end up fostering a culture of hiding problems.

This means problems don't get fixed. It means the senior devs don't find out how risky their stack is, management doesn't get any signal about how precarious everything is, and for sure every little panicked fix is making the whole thing more dangerous.

And then you realize, as you're wondering why all those stern looking folks with guns are rushing around in fashionable windbreakers, that hiding technical fuckups isn't too far from hiding financial fuckups. Or ethical fuckups.

If you're working with real things instead of just computers, this is even more important. Let me quote from the U Bristol 'accidental explosive' report (emph mine):
"...having realised what had happened, the graduate student immediately took the action needed to mitigate a potentially dangerous situation, rather than delaying or, worse, trying to cover it up. This was highly responsible — the most important thing done — and shows the value of investing in developing and fostering a culture in which colleagues recognise errors and misjudgements, and they are supported to report near misses."
posted by Skorgu at 6:21 PM on June 3, 2017 [73 favorites]


Yeah I wasn't clear on the size of the company here

The dev posted a comment that said it was "not really a small company, dev team is around 40+ people."
posted by ndfine at 6:30 PM on June 3, 2017 [3 favorites]


I find it amazing that their production database was configured such that deleting rows also deleted them from their rotating, off-site backups.
posted by bigbigdog at 6:45 PM on June 3, 2017 [6 favorites]


Holy christ, if this had happened at my last job, multiple people would have been fired. I seriously don't understand how this level of incompetence is possible. I get why the OP can't name and shame but I wish they could hint so others could avoid them.
posted by AFABulous at 7:00 PM on June 3, 2017 [3 favorites]


I do not understand people's cavalier attitudes with backups, and so really don't understand companies. I am self employed, not a programmer of any kind, and I have multiple backups of every drive, and critical stuff on a cloud backup. I have drives all over the room. Right now I'm worried because when I update my OS I pull the drives and buy a new set to install, and that, along with getting a new computer means I only have 1 backup of my current system drive until I go to the store and get another drive. 1 backup! That's nothing!

I am totally jinxing myself right now.
posted by bongo_x at 7:03 PM on June 3, 2017 [5 favorites]


Early in my career at cisco I accidentally disabled all security across all platforms in the code. There was immediate talk about firing me but after I explained how it happened that changed to "That was sloppy but understandable. Don't do it again."

I feel like it was a pretty big plus to my career. It resulted in a close code review that put me on the radar of some very senior engineers, and less than a year later people had forgotten the incident but vaguely remembered I knew something about security.
posted by Tell Me No Lies at 7:35 PM on June 3, 2017 [17 favorites]


As one of the greybeards in my team, I frequently mentor new hires and get them up to speed.

At first I thought my job was to be a technical mentor, but no matter how throrough I tried to be, new hires kept making mistakes, trying to fix them or cover them up, and we all were worse for it.

I remember how it was for me at first. Impostor syndrome, fear of embarrassment every time I did not understand an explanation, fear of asking questions. Constantly in fear of making a mistake and being fired or, even worse, humiliated.

Hell, I spent three months trying to fix a mistake I made, sure I was going to get fired if I asked for help. I succeeded, but it sucked.

So now my focus is to make sure that new hires understand that it is good to ask questions. It is good to ask for clarification. It is good to ask for help as soon as possible. There are not stupid questions and there are no stupid ideas to discuss.

It is hard to get people to feel safe asking for help, so I try to lead by example. Even if I think I know the answers to questions I say 'I am not sure, but I'll help you figure it out or find the right person to ask'. In meetings and presentations I play the fool (sometimes I am the fool) asking for clarifications and examples. I am the first one to talk about my own screw ups in meetings. I tell the new hires that mentorship is one of my performance metrics, the more questions I can answer and the more fuckups I can help fix the better my bonus.

At first it was bad for me in terms of reputation and peer reviews, but I think people are starting to see the benefits, and I get good reviews from the newer hires.

So to confirm for the millionth time, the company here was stupid and the CTO an asshole.

If any junior Dev is reading this, PLEASE don't hesitate to ask for help and admit mistakes. If the place is a place worth working for, you won't get in trouble* and you will make friends and allies (being able to show vulnerability seems to attract the right kind of people).

* Unless it was not an innocent mistake and you abused privileges, intentionally faked credentials, leaked confidential information, etc...
posted by Dr. Curare at 8:01 PM on June 3, 2017 [40 favorites]


I'm partly responsible for mentoring a junior dev just now and this thread has been great. Agree completely you can't afford to live in fear of mistakes. I might send him to this thread at some point just so he gets the benefit of all these great experiences and reflections. Thanks again, MeFites!
posted by saulgoodman at 8:08 PM on June 3, 2017 [3 favorites]


With regard to Skorgu’s comment, the term of art is “blameless postmortem”. The goal of the exercise is to understand the processes that led to the failure, mitigate any fallout, and prevent the problem from happening in the future. As the name would indicate, the objective is not to point fingers or punish anybody.

The way I see it, the main axioms of a healthy postmortem culture are
  • failures are the result of systemic problems; individual errors are only the proximate cause
  • people are prone to making errors; this is not an “if” but a “when”
  • because people will always make errors, we should learn from them
As a result, the idea becomes that you should design systems to prevent these errors from happening, and when mistakes do cause failures, you can learn from them to prevent the same error from causing a failure in the future. Because there’s no blame, an individual can say “I ran Command A rather than Command B” and in the future the situation can be made safe rather than that person fearing reprisal.

(Another important prerequisite is, essentially, “We’re All On The Same Team”—that everyone involved was operating in good faith and without malice; for example, nobody wanted the system to go down. In situations where the incident was the result of sabotage, this doesn’t work as well, but you can still look at the postmortem from the perspective of the rest of the team, with the goal changing to “how can we ensure that one person cannot cause this much damage”. I’m less familiar with this type of situation.)
posted by reluctant early bird at 8:13 PM on June 3, 2017 [23 favorites]


Where the fuck was the backup? Oh, that's why he got fired - CTO realized his team didn't have functional backups so it became the JD's fault.

* I once deleted the transaction logs from an exchange server, then changed the date accidentally. That was fun.
posted by disclaimer at 8:15 PM on June 3, 2017 [1 favorite]


Blameless portmortems are great. Another feature of them is discussing "what went right." Like it really sucks when something breaks but it's nice to know that your monitoring/alerting/paging infrastructure all works great and you can point out that response times to the problem were measured in minutes.

I've been programming for 20 years and obviously I still make mistakes. I've had a few bad ones just this year. I think the biggest change as you mature is that your mistakes require a cascade of failures to trigger them. You get good at catching the obvious potential problems, and then pretty good at catching problems that require another system to fail before they can be a problem and so forth, but there are always going to be problems that are triggered by N mistakes where N is still finite.

It's sort of a measure of pride now to see how many concurrent fuckups are required to cause a systemic failure.
posted by RustyBrooks at 8:42 PM on June 3, 2017 [7 favorites]


Oh yes. In mumble mumble years of programming I still have not learned how to do it the right way, but I've become very very good at telling when we are doing it the wrong way.

I make huge mistakes at a slower rate now, so it takes me longer to learn :)
posted by Dr. Curare at 8:54 PM on June 3, 2017 [1 favorite]


In my first summer internship during university, I was hired into a project under the federal government within a ministry that isn't really relevant, involving distributed computing and what were fundamentally FORTRAN-based models that scaled well across multiple machines. Pedestrian now, but it was pretty radical at the time, especially in a federal organization. They needed a student who knew Perl, because simple scripts were what was running their job queue and merging of results.

For the most part, it was a great few months, but I did manage to wipe out a week's worth of simulations by accidentally running a test of a script using the settings for the production cluster. Everyone was very understanding, but I did bring in coffee for the team for a week. I also wrote the safeguards to make sure it couldn't happen twice.
posted by figurant at 9:11 PM on June 3, 2017 [3 favorites]


> r/w/d rights are a privilege, not a right. They need to be earned.

They are a hazardous substance and they need to be contained to avoid accidental spills or explosions. When a hazardous substance causes damage, it's no solace to think that the people involved were more senior.

The culture of macho privilege and survivorship bias in companies like the OP's does nothing to prevent future accidents.
posted by Phssthpok at 9:20 PM on June 3, 2017 [8 favorites]




AFABulous: Imagine being the guy who set up BA's version of freaking Landru and didn't manage to put a decent battery back up system or redundant power supply in the thing.
posted by Grimgrin at 9:35 PM on June 3, 2017 [7 favorites]


I accidentally killed production once, many years ago - partly due to my own inexperience and partly because I was acutely aware that *we had no redundancy*, and was trying to fix that. Of course, 1 instance is better than 0, and that was a trying week.

This company is really stupid though. You don't let anyone except the steadiest, most senior people near production, let alone a new hire with no experience - and you sure as shit don't put production credentials in widely distributed docs. Bah. Everything about this is wrong. Bullet dodged for the new guy though - he doesn't actually want to work there.

I think the company gets double black marks as well, for not having a culture where it's safe to admit mistakes or confess ignorance. As others have pointed out that hurts performance over the long term as more and more things remain unaddressed.
posted by iffthen at 10:42 PM on June 3, 2017 [5 favorites]


Letting the new guy drop the prod database is dumb but it's not the worst thing about this story. Firing him is.

Yep, very much agreed with that.

Also, from the reddit thread (by, err, "kqgumby"):

That's a knee-jerk reaction on the CTOs part. In all honesty, it sounds like he's covering his own ass, especially with respect to blocking you on slack. He wants you out of the building so he can tell his story--not yours.

Exactly that, I think. And if the CTO was a normal mid-level manager he'd be less culpable, because he'd be partially bound the company's dysfunctional culture as well, but he's a C-level executive, and is at the top of the food chain.
posted by iffthen at 11:08 PM on June 3, 2017 [6 favorites]


First of all, my bad, should have used a more inclusive term.

Second, what makes you think only white males can have grey beards?
posted by Dr. Curare at 11:41 PM on June 3, 2017 [10 favorites]


I think "greybeards" is an illustrative term to make the point of seniority.

I am neither gray nor male.
posted by maurreen at 12:24 AM on June 4, 2017 [4 favorites]


They are a hazardous substance and they need to be contained to avoid accidental spills or explosions. When a hazardous substance causes damage, it's no solace to think that the people involved were more senior.

The culture of macho privilege and survivorship bias in companies like the OP's does nothing to prevent future accidents.


No, but the seniority helps to guarantee the persons granted the rights understand the systems involved and how they are used.

Yes, mistakes happen. But training and experience can help mitigate some of those.
posted by Samizdata at 12:29 AM on June 4, 2017


Where the fuck was the backup? Oh, that's why he got fired - CTO realized his team didn't have functional backups so it became the JD's fault.

Agreed. Investors are going to be asking some pointed questions about engineering practices at the company. Maybe a sacrificial lamb will satisfy them.
posted by Tell Me No Lies at 12:32 AM on June 4, 2017


his team didn't have functional backups

Functional backups. This is an important point a lot of people are missing. The quote from the original post was "I kept an eye on slack, and from what i can tell the backups were not restoring". They thought they did have backups. But restoring backups is a giant pain in the ass,* so they didn't bother testing it. And maybe it didn't occur to them until now, that if you don't test your recovery process, you don't know for sure whether you actually have backups.

They thought their whole backup situation was a solved problem, just like their new hire documentation.

* I'm only an IT enthusiast, not a professional. I have no idea how you test restoring production backups. Do you need, like, an entire spare production-grade environment to restore it to? Aren't those huge and expensive?
posted by reprise the theme song and roll the credits at 1:00 AM on June 4, 2017 [1 favorite]


Thanks, that's pretty interesting, even - maybe especially - to someone who doesn't work in that field.
posted by nicolin at 1:28 AM on June 4, 2017


> And maybe it didn't occur to them until now,
> that if you don't test your recovery process, you don't
> know for sure whether you actually have backups.

That's bad too. It's axiomatic that anything not tested is broken, and regular testing of backups should be SOP.

> Do you need, like, an entire spare production-grade environment to restore it to?

Pretty much

> Aren't those huge and expensive?

Yes. That's part of your operating cost. One of the giant advantages of cloud operations is that you can create these for just as long as you need them rather than having to keep one going all the time.
posted by merlynkline at 1:57 AM on June 4, 2017 [5 favorites]


Veteran production DBA here. Number one WTF: production credentials in a newbie doc, and no access controls. Number two nightmare: bad backups. That poor person - talk about a disaster waiting to happen, they innocently trotted right into it.

reprise the theme song and roll the credits: ideally, yes, and not necessarily - databases don't have to be big to be critical. Depends if you want to test recovering the whole system end to end - which might involve filesystem backups as well as multiple databases, all synchronised - or whether you just want to make sure your backups will restore. At minimum, many databases have a quick utility "is this a good backup file" check. I'm pretty old school and nowhere I've worked has allowed cloud access (security worries) but these days for most people there's really no excuse, as merlynkline points out.

It's astonishing how hard it is to make non-tech management understand why DR (disaster recovery) testing is important. I've found an effective message when facing "but it costs so much why are we buying a box that 'does nothing' / couriers to take backups offsite / time and resource in DR testing" is "how much is your business worth to you? If it's down an hour? A day? Six weeks?"

If your umbrella is important to you, you buy commensurate umbrella insurance. Same thing. And you don't give it to your brand new puppy as a chew toy.
posted by Ilira at 2:12 AM on June 4, 2017 [2 favorites]


I'm surprised nobody is calling hoax, as in the guy who "accidentally deleted his entire company" a few months ago that turned out to be BS....
posted by mrbill at 3:05 AM on June 4, 2017 [1 favorite]


mrbill:
I'm surprised nobody is calling hoax, as in the guy who "accidentally deleted his entire company" a few months ago that turned out to be BS....
This is someone who was following documentation, and copying things from the documentation into their tools. They then did a full reset, believing they were resetting their work environment only. These are actions people go through every day, whether they are new to a company or just need to set-up a clean environment. No reason to call hoax at all.

All of us who work or have worked IT know that this entire thing is more than possible. I would even bet that most of us have seen companies mess up database back-ups and/or credential management. It is not about taking down a company, it is about how easy it is to get the simple things wrong.
posted by Martijn at 3:59 AM on June 4, 2017 [4 favorites]


It's so reassuring to read all the recognition of the importance of a healthy work culture. Psychological safety in teams makes a huge difference.

It's astonishing how hard it is to make non-tech management understand why DR (disaster recovery) testing is important.

FTFY. I go to companies and audit/analyze (whichever word they prefer) their existing test systems to then draw up and implement testing process improvements (aka test policies/strategies). The latest fad for reducing the cost of testing on the part of management is automation. There's automation, and then there's well-done automation. If you don't have a lot of time and need a quick feel for the company's testing culture and they've mentioned automation, say automated unit tests so we can assume that the general coverage is already known, the questions to ask are: "great, automated unit tests. What do they test, in more detail? Are the automated tests maintained?" and watch 80% of faces first go blank and then ash white. Clicking a button to run a test is cheap, yes. Unfortunately if no one knows what it's testing, you might as well be flying paper airplanes around the office.

Anyway even besides the automation question, the latest crop of managers who recognize testing as a need are pretty rare, and beyond that, it takes increasingly more time to bring them to an understanding of how money spent on testing is an investment in quality. All the test analysts and experts I know have as much knowledge of business analysis as we do testing from the sheer need to be able to show clients as quickly as possible that they're saving money by improving their testing. Design and tech teams understand it implicitly, at least.
posted by fraula at 5:13 AM on June 4, 2017 [5 favorites]


I have no way to know if it IS a hoax, but as I'm working for a company that is scaling up, it's entirely plausible. One of the things that happens as companies scale is they fail to review things that were (almost unavoidably) seat-of-pants when they were smaller, and sometimes they fail to check whether the people responsible have grown with the responsibilities.

This document the New Guy was reading from may have been written during a Doritos-and-Dr-Pepper fueled all-nighter, back when the dev team was 2 people and there were <10 paying customers. And not paid much attention to since; it's just the document that newbies use when setting up. 30+ newbies read it and DIDN'T make the mistake, and didn't feel like mentioning the security problem to their new boss on their first day or two of the job, and never looked at the document again, not even when they were handing it to the next new guy...

I'm betting the CTO was one of the 2 people. And he should have been reviewing policies and documentation, and somewhere along the way he should have hired someone to handle on-boarding of new people, or delegated that responsibility to someone. But there's usually something that seems more important to do.

Process management/systems thinking is what I do, and it's so boring. Until it's terrifying.
posted by randomkeystrike at 5:22 AM on June 4, 2017 [3 favorites]


I've been in screaming matches with colleagues over documentation. I think we each would have drawn different conclusions from that Reddit post. I would have used it as proof that budgeting time for correct and timely documentation should be mandatory. They would have used it as proof that anybody who needed an onboarding manual was too stupid or lazy to deserve to be paid to write code.
posted by at by at 6:01 AM on June 4, 2017 [4 favorites]


At the company where I work as a software developer, we're large enough that we have different teams that do different functions. In the past, the software dev VP, for example, would have just assumed that the other group responsible for doing backups of tools that the software dev organization uses was doing its job correctly.

I'm on the software infrastructure team, and we've been working hard to break down some of some of those walls a little. We succeeded in getting a new tool in place, and, importantly, a group of us software devs manage it ourselves, not the tools team. This arrangement provides us all kinds of advantages. My coworker left the company recently, and at least until we replace him, I'm responsible for making sure that this volunteer tool management team continues to meet and do its work.

A couple of weeks ago I asked our VP if he knew whether the mission critical data in this tool was being backed up, and he answered like he always has, "I assume so." My blood went cold. (It turns out, we manage the tool itself, but another team is still responsible for backups, which I wasn't sure about at the time) So, at our first tool management meeting this past week, I made sure that the first item on our agenda was to verify that backups are being made, that members of the management group understand what's being backed up and how, and that we start the process of doing a practice restore to verify the backups. My coworkers on the team all concurred that this was high priority.
posted by tippiedog at 6:47 AM on June 4, 2017 [2 favorites]


The concept of structurally avoiding scapegoating is also a big part of Toyota's Five Whys which was adopted into buzzword bingo Six Sigma and Kaizen (though of course it's pretty easy to follow the letter and still end up with "did you follow the docs perfectly y/n?" type questions and avoid all the actual hard parts).
posted by Skorgu at 7:02 AM on June 4, 2017 [5 favorites]


(On further thought)

Bad response: Firing the poor dude.

OK response: Fixing the docs, locking down the prod db, fixing backups.

Good response: "Hey, why don't we make 'restore a backup into a new dev environment' be part of the newb setup docs?" Now every time you start a new person you've validated (to some extent) your backups.
posted by Skorgu at 7:05 AM on June 4, 2017 [4 favorites]


Even without having worked in IT (not counting very simple stuff on the par of installing software and "have you tried rebooting"-level support), I can relate to this, as I've ran into the combination of inadequate training and supervision and the responsible management covering their asses more than once. Hell, I've been on my current job for nearly fifteen years, and I'm still dealing with the aftereffects of the behavior of my first boss, who was extraordinarily toxic.
posted by Halloween Jack at 7:30 AM on June 4, 2017 [2 favorites]


Jessica Savitch's Coke Spoon:I have witnessed a seasoned software architect do the same sort of thing. Oh, and at least four senior MTS types. Funny thing was that the document had, in big bold letters, DO NOT USE THESE VALUES. EXAMPLE ONLY.

I've seen that happen at an old job too. Thanks to the miracle of Word formatting, the example and the warning were separated by a page break. A co-worker found out very quickly that page 21 said never to do the thing that page 20 just told you to do.

When I had a fuckup of my very own that could have costed our client a substantial sum, I asked a trusted colleague if I should offer my resignation, hoping that it would signal how responsible and serious I was about owning and fixing the error. I will not forget what she told me next: "Absolutely not. They may accept it and where will that leave you? Let management write you up or do whatever they're going to do, but never offer yourself up like that"
posted by dr_dank at 7:45 AM on June 4, 2017 [6 favorites]


I'm surprised nobody is calling hoax

The difference between this scenario and the "deleted the whole company" thing from a while back is that this one is so plausible. This seems 100% like something that could easily happen in a sloppy workplace without the right safeguards.

I mean, I have scripts on my work machine right now that—if you somehow obtained and put in all the credentials and resource names for production instead of localhost—would basically blow everything away and cause a similar situation. And if for some reason the docs that you handed a new developer had the prod addresses etc. in it, and they had write access to prod... it's an accident waiting to happen. So, given that we're all dogs on the internet, it doesn't really matter if it's a hoax because we probably can't really tell, the important part is that it totally could happen as stated and thus the lessons from it are probably pretty valid.

(I also have some nice scripts that recursively run git reset --hard HEAD && git checkout on every repo it can find starting from the current directory, which I think of as being sort of like a table saw with all the blade guards removed. Very convenient, very lazy, arguably very dangerous. You won't hurt anyone but yourself, but you'll hurt yourself real bad... As Neal Stephenson once wrote while making his own software / power tool analogy, "The danger lies not in the machine itself but in the user's failure to envision the full consequences of the instructions he gives to it.")
posted by Kadin2048 at 7:50 AM on June 4, 2017 [4 favorites]


I DoSed the PDC on the first day of my first corporate job.. I was told to 'get familiar' with the tools and the environment, so I ran eEye Retina scan on the local subnet. It tried to do a little brute-force password guessing when it found the Domain Controller, and the account lockout policies (for ALL accounts including administrator) took effect after 5 failed logins. I still laugh about that today, more than 15 years later.

I was not fired. The company bought another company less than a year later and went bankrupt because some C-level asshat (who i hear has a university building named after him now) didn't do his due diligence properly prior to the closing of the buy-out.

/anecdata
posted by some loser at 7:57 AM on June 4, 2017 [1 favorite]


Having a production setup that a new hire can destroy during their onboarding unless they're savvy enough to avoid doing so? Surely that merits some sort of institutional Darwin award for the company.
posted by acb at 7:59 AM on June 4, 2017 [4 favorites]


I'm surprised nobody is calling hoax, as in the guy who "accidentally deleted his entire company" a few months ago that turned out to be BS....

Could turn out to be a hoax, but that's true of everything you read on the internet. If someone crafted this little story then... good for them, I guess?
posted by Tell Me No Lies at 9:05 AM on June 4, 2017 [1 favorite]


posted by Tell Me No Lies

Eponysterical.
posted by radwolf76 at 9:09 AM on June 4, 2017 [1 favorite]


I arrived at a very large company once to find out that the small team I was joining had been working for over a year and had never done a single backup. Even after I got a basic backup mechanism in place they never took it seriously. Actually testing a recovery? Yeah right.

Good times.
posted by Tell Me No Lies at 9:10 AM on June 4, 2017 [2 favorites]


This was interesting to read for me as someone in a field which is not tech, but which does involve crisis moments of error-checking and then manipulating lots of data into closely-specified formats under time pressure. Each time, the end product is semi-unique, and so the process involves a lot of ad hoc elements. Things blow up a lot, even when you have tons of resources. My current job involves many fewer resources and, despite my having been rather careful in designing a process recently, things blew up in an ugly way for which the repercussions are still resounding. I failed to anticipate how incredibly bad someone could be at a very basic Excel task, managing not just to screw it up, but screw it up in a way that specifically served to mask the fact she'd screwed it up when I did my spot checks on her work. Like, if she had planned to sabotage the project, she could hardly have done better.

Moron-proofing is fucking expensive, man. It requires so much time, energy, and patience to envision every possible way a person could perform a task wrong. Especially in an environment where, frankly, you aren't working with the best of the best, or even the middle of the best. (It doesn't help that I have a character flaw in common with many tech people, which is to get frustrated with stupidity very easily. It's like my brain has accepted the unquestionable fact that errors are ubiquitous and processes must be designed around that rather than around the assumption that people can be made good at their work, but only down to a certain depth, and then it goes, "Are you freaking kidding me with this? I have to assume that if I tell you to walk down the hall, you will instead set fire to it?") It must be nice to be in a field where you can literally prevent many problems just by not giving permissions to people.
posted by praemunire at 10:07 AM on June 4, 2017 [7 favorites]


I worked for ten years for a huge Wall Street firm as part of the "security administration" group, which involves account creation, deletion, modification and validation. One quarter I drew the short straw in the group and had to sit in on the quarterly disaster recovery practice session. So there I am at 7 PM, still in the office (no skin off my teeth, I was getting flex time), on the conference call, when suddenly one of the guys says, "Uh. The restore isn't validating."

Dead. Silent.

'What?"

"The backups. They're not validating. So they won't restore."

"Shit."

Lots of yelling starts, people start throwing accusations around. I stay silent, because my part of it is for after the restore in the DR environment. Then I hear someone mention my department. I pick up the phone and say "Mephron from Sec Admin here." The bigwig in charge of this asks me to check something; I do, and see that the guy doing the restores has a different set of permissions in the disaster recovery environment than the production environment, in a way that would cause this exact issue. More squawking, and I go through logs and find out that it was specifically requested that way...

...by the bigwig in question. I sent him an IM through the internal system rather than calling him out on it on the call, and he proceeds to say, on the call, that I just IM'd him and told him very politely - his words - that he was the jackass causing this. He took responsibility for this in front of everyone when I carefully gave him an out for it, and that got him massive respect from everyone. It took five minutes for us to get the ticket pushed through to me - Sarbanes-Oxley compliance required it - and once I fixed permissions, everything worked fine. But that experience taught me two things:

1) this is why we do DR tests, to find these issues
2) someone who takes responsibility for these kinds of things and admits his mistakes, then works to get them fixed, is going to get a lot of respect from his people compared to the ones who try to cover them up until the shitstorm blows over.

(epilogue: the bigwig got replaced by someone whose entire way of doing things was based on the CFS method - Cover, Find Scapegoat. This resulted in a lot of people keeping parallel backups to cover their own asses when it appeared the scapegoating was coming to them, and a culture of paranoia coming into being. It made me happy to get laid off.)
posted by mephron at 11:13 AM on June 4, 2017 [26 favorites]


That's an unfortunately common outcome, mephron, in my experience. The non technical management doesn't necessarily know what traits/virtues to value and end up canning the more responsible people on the team, not realizing I guess, that they're selecting for ass-covering and throwing teammates under the bus, which in the long run, rots a dev team from the inside out. Its sort of the ugly workplace effect of the gotcha culture, as being seen as fallible at all becomes unacceptable.
posted by saulgoodman at 11:31 AM on June 4, 2017 [3 favorites]


If this story seems so far fetched as to be more likely to be a hoax, I have some bad news about the "real world"...

Others have covered the CTO's actions as an example of a toxic workplace, but I want to call out anyone at the company that noticed the footgun that was lurking in the documentation and didn't change it. Good internal documentation is a team effort, and the team failed (for whatever reason; toxic workplace culture seems quite possible).

Netflix allows all engineers production SSH access (via tools they built and open sourced) but that only works because company culture allows it - not the other way around. Blameless post-mortems with actionable remediation to fix tooling is a large part of it.

Snarking about making things idiot proof and the world will make a dumber idiot is counter-productive cynical punditry and should also be discouraged.
posted by fragmede at 12:24 PM on June 4, 2017 [4 favorites]


The longer I work in this field, the more I realize that mitigating against production-level fuck-ups is a race you can't win. You can only make sure that you don't lose catastrophically.

So for every audit, security restriction, and test that you create, you have to know there are a handful you're missing and you need to be robust enough (and aware enough) that you're going to need to come up with new, novel responses on the fly. And if you don't have any of those things to begin with, along with a tested backup or failover strategy, you've kneecapped yourself before you even started to react.

He was clowned mercilessly, but to paraphrase Donald Rumsfeld, it's those unknown unknowns that are going to get you ever time, man. At least know how to fix your known knowns and do a little digging on your known unknowns.
posted by mikeh at 2:26 PM on June 4, 2017 [2 favorites]


You could even argue this was a case of the elusive fourth quadrant, the unknown knowns. If you told someone at this company that you were handing the knowledge of the production database location and login to someone on their first day, would they have been cool with that? It sounds like the answer is a resounding no, but they did exactly that. They didn't even know what knowledge they were sharing.
posted by mikeh at 2:30 PM on June 4, 2017 [4 favorites]


Back in the early 90s, I worked for a small magazine with 20,000+ subscribers. One day one of the senior staff (well, technically we were all senior staff because this was a lesbian magazine run by a collective—so, say, one of the longest-standing members of the collective) got on our single computer to format a floppy. Of course, she made the classic mistake and told it to format drive C instead of drive A. There was no prompt saying, "Are you sure you want to do this?" The computer just did it.

It took out our entire subscriber database. There was no backup; getting around to setting up routine backups (which someone had to do by sitting at the computer feeding it endless floppies) had been on the general to-do list for ages.

We ended up having to re-enter the entire database from index cards on which each subscriber's info was written, a holdover practice from before personal computers were routinely available. And then we had to proof-read the whole thing.

While we were engaged in this tedious work, we chatted a lot about how mistakes are inevitable and how glad we were that we were in a profession where even a major mistake like this meant a whole lot of typing, and a magazine going out a bit late, rather than, say, a patient dying under our hand or a spaceship exploding.
posted by Orlop at 3:47 PM on June 4, 2017 [3 favorites]


Haha yeah before version control was a common thing, our release process was "rsync the dev code to prod"

How many times did I do that before I accidentally did the reverse? Well, a lot, but not enough.

A conscientious coworker had actually printed out our dev code for some reason the day before. I don't know why. But we used that "paper backup" to restore a lot of overwritten code. We even had our poor secretary transcribing code.

We started using CVS not long after that.
posted by RustyBrooks at 4:24 PM on June 4, 2017


Windows Remote Desktop was super laggy, back in the early 2000's. Like, so laggy your mouse click to the web server on the opposite side of the continent might get delayed long enough to expand the prod folder instead of the staging folder. Which is right below it in the folder tree. With exactly the same file structure. Like when you're testing a small change, and get frustrated because the file changes you're dragging over don't show up on the staging homepage.

And then you say fuck it and go for lunch.

And when you get back you see pants on fire.

But they're not yours.

Not just yet.
posted by CynicalKnight at 4:31 PM on June 4, 2017 [2 favorites]


As a DBA of 10+ years, I was increasingly and increasingly horrified in reading that tale on reddit. But not a single bit of that horror came from what the OP had done.

First day junior dev failed to follow instructions. * thinks "well I hope this was first day jitters and not the first in a trend."*

They put prod connection info in a plaintext doc available to first day employees? *shudders*

Those credentials had WRITE PERMISSION?! *crosses self*

They didn't have functional backups? *feels heart skip a beat, chest feels tight*

They told an employee to leave and never come back, but let the employee walk off premises with company equipment that apparently had a doc with prod connection info on it at a minimum, and failed to disable the employee's credentials for some time afterwards? *I need to go lay down, I'm feeling faint*
posted by jermsplan at 8:00 PM on June 4, 2017 [12 favorites]


My previous position was IT for a small manufacturer. They were years behind on updates to their critical ERP system, the server hardware was old. We couldn't upgrade workstations until we upgraded the ERP, since it only ran on Windows XP and the new systems would all come with Windows 7. I was able to get a tape jukebox for backup (instead of the old single-tape drive) but it failed after warranty was up and I wasn't given more money to replace the system.

By the time I left for a much better (and better-paid) job I was doing backups by Robocopy from server to USB drive, rotating the drives by taking them home each night.

Management didn't care, and didn't understand the need to care. I left, and they're still in business, so maybe someone wised up.
posted by lhauser at 9:17 PM on June 5, 2017


Ouch. Here's my boring story:

I was an in-over-my-head Network Administrator for a company of about 50 people back in the early oughties. I fell into it by default because I was the only person there who knew anything at all about IT or about Windows networking (which in turn I only knew by reading about it and tinkering with stuff at home, I had never actually done it professionally). This was an IT recruitment company, by the way. I was actually hired to write ad copy and maintain/clean up their database, which I also did, despite having never done those things either.

Anyway, I learned pretty quick and have to say I did a pretty good job, most of the time (though I let my ego get in the way a lot and so was pretty shitty at customer service). There was an external guy who had come in and set everything up originally, and who popped in once every couple of months, so I shadowed him whenever I could, and he was cool with answering simple questions via email (stuff I couldn't Google). If it got more complicated than that he charged his hourly rate, which was fair enough.

After a year or so the owner/CEO decided or had been convinced that his hardware was shit (it was) and forked out many many thousands of dollars on a new dual-processor Small Business Server running Windows SBS 2003, and a new tape backup system. The external guy came in and he and I stayed back until 3, 5, 7 in the morning setting everything up over a long weekend. We set the old server up to handle overload or be a print server or something and act as an emergency backup for the all-important client database and MYOB.

The new SBS was beastly and at the time was the most impressive piece of hardware I had ever seen. Pretty sure it was a HP machine. Squat and chunky and on its own set of castors. It also had a RAID array, which I was dimly aware of as a "thing" from PC Authority or whatever, and the external guy said it was a mirrored array, not a striped array, across three disks. I was like: ok, and we left it at that. He showed me how to manage Active Directory, set up new users, how to manage Exchange, how to run the backups, it was great and I picked up some neat skills that impressed everybody (at the company, anyway).

Not much later a Terrible Thing happened. Some dimwit got a virus on their workstation, and the virus got on the server. It was running Norton whatever-it-was and I had been very diligent with applying updates and rolling them out, but I was still essentially an ad writer and database maintainer who provided desktop support, and still didn't really "know" in the sense that somebody who has been doing a network admin job for more than a couple of years "knows", or who has received proper training and guidance "knows".

Also, to be honest, I wasn't getting paid very much, the pay was monthly and I was always broke, and I had just separated from my girlfriend and was horrifically depressed and very rarely thinking straight, and was in my 20s, when everybody on Earth who ever lived is the biggest idiot they will ever be. So while there was some level of motivation to learn things just to make my job easier, and because I was still nerdy and interested in stuff, there was very little motivation to "know".

First I heard of the Terrible Thing was when users reported that files were going missing. I was like, well, maybe somebody else deleted it? They were adamant that no, that wasn't the case, and they told me where it was happening and I jumped onto the server and had a look and saw that files were literally disappearing before my eyes, from multiple directories. Other files were being renamed and the RAID array was going absolutely mental with activity.

Did I mention these were "hot-swappable" drives? I don't think I did. Well, hot-swappable to me meant something that, at the time, it didn't actually mean. Like ammo mags on a pulse rifle, you clicked a little latch to one side, then a bigger latch got released, and then you pulled out the hot disk (because, hey, you needed to slam another one in the there! "Swappable" meaning you can swap them and "hot" meaning while they're running, surely!).

And so I "hot-removed" one of the mirrored RAID disks, thinking, in my dumb panic, that I could at least save that one disk and we could later put it in some kind of magic forensic computer in a controlled lab and get the data off it. I'd yank the drive and then I could send out a network message asking people to log off because I was shutting the server down. So I pulled that drive out while the machine was still running.

Long story still long, that was a very expensive mistake for me to make, because it turned out: backups weren't running properly, because most of the tapes were dodgy (we had never done a test restore). The SBS was fucked and needed to be rebuilt and we lost about three day's worth of data, and the external guy had to call in another external guy to pull everything back together.

I wasn't fired, though I probably deserved to be. The owner was (it later transpired) an incredibly dodgy bastard but always had a soft spot for me and the way the external IT guy explained what had happened cast me in the best light. In retrospect, despite the shitty pay, I actually had it pretty good there. But I only stuck around for another three or four years and have, thankfully, never again made a mistake that big (or even half as big) in my working career.

TL;DR - Pulled a "hot-swappable" drive out of a running mirror RAID array because the machine had a virus on it and I wanted to save the data before following standard server shutdown procedures.
posted by turbid dahlia at 9:40 PM on June 5, 2017 [3 favorites]


« Older Lots and lots of badges   |   Archaeology's Threatened Sites Database Newer »


This thread has been archived and is closed to new comments