Velocity 2011 - Advanced Postmortem Fu and Human Error 101

Advanced Postmortem Fu and Human Error 101

Etsy is hiring. A Website is always a complex, dynamic system. There are always fundamental surprises. A love affair is a fundamental surprise for the one who is discovering has been cheated. There is always a balance in between Efficiency and thoroughness. Organizations are the blunt end. They provide the scaffold for people at the sharp end, operators. Post-Mortem is about telling the past. We do them to understand the system we are working on. Networks, Servers, Applications, Processes, People. The People are part of the system.

The conversation with Gene Hackman

To check that.

You need data and graphs to understand. IRC logs, annotation traces. It is important to inject the history of events into the graphs.

When are things starting,

when it has been detected? TTD

When do you resolve? TTR

Basically how do you monitor the system and your ability again to resolve a bad situation.

Severity

We have to define what is important to your system. Not the same type of severity for every Web sites. John has levels of severity for Etsy. Post-Mortem being about the past it is filled with a lot of stress and mis-representation of what it actually was.

Ignoring how people feel is a huge mistake. It has to be taken into account. Some crisis patterns include things which have not been fixed for a long time. It ensues shame.

When the response is very long. There is a complex system field which involves stress growing exponentially until it is stable. What happens during this time have a strong influence on the way the company evolves and the technology. It challenges everything: people roles, capabilities

There are situations where you are not stuck enough and some where you are too stuck. Heroism is wrong. If the person doesn’t fix it, we lost time. If the person does fix it, we rely on singular people.

Improvisation is key to success. It helps to fix unexpected events.

Hindsight bias

Knowledge of the outcome influences the analysis of the process. In reality is a lot more complex than the scenario you had imagined. Before the accident: The future seems implausible. After accident: Obviously clear. “how could they not see the mistakes.” People have a tendency to want to be right instead of being objective.

Root Cause Analysis

There are many methods. A popular one is the five “why?”. Repeating why a few time and finally we get to the root of the issue. It is a good method for making people talk about it, but it is a sequential model when systems are more complex. Sequence of events model is easy to explain, and easy to understand, so they are reassuring, but it doesn’t take into account the surrounding circumstances. It is not helpful.

The domino model is better but still unhelpful.

There are always multiple contributors. Systemic is a lot closer from the reality, but we have to include into it sickness, people are not here, budget, etc with a lot of interdependencies with the computing systems. The blunt end always need to be included. Functional resonance are important: people with bad mood, the finance department.

Causes are constructed, not found.

Preconceived notions on “causes” and behaviors. We have to talk about contributors, not causes. There is no single root cause of your failure.

Human error

Nobody comes to work to do a bad job. The work always makes sense to them. It is not by bad will. It is useless as a label and ending point for discussing stuff. “Be more careful” doesn’t fix anything. Human cause isn’t a cause it’s an effect. Why it didn’t make sense at the time for the person? Error categories? Slips, Lapses, Mismatches, Violations. Useless, they do not help to prevent them.

What and why do things go right?

You have to look at what you do right. There will be more things to look at. Why don’t we fail all the time? Be open about your mistakes and clear such as “don’t do what I just did”. Examine how changes (at all layers) will produce new vulnerabilities? Use technology to support and enhance human expertise. Automation is not an holy grail. There will always be a machine-man boundary. Explore with new forms of feedback. The systems are not inherently safe. Somehow pre-mortems are better than post-mortem. It is what I call Optimization.

Adjust culture

No Blame. No Name. No Shame. It is not helpful. Punishment is not leading to good results. Punishing deterrents is a losing proposition. Firing people, reducing pays, etc. cause more stress and will certainly entail to hide future failures, which is bad. Making errors are not choices. so Why are they happening. There are discretionary spaces where humans have controls over situations at any levels and then there is an idea of accountability. There is a blurry line between unacceptable and acceptable. The question is not necessary where is the line, but who draws the line and are people aware of this line? You need to make people aware of the position of the line. Supporting learning then you increase accountability. Basically you make people responsible of their failures. “I screwed and I will fix it”. People can own their own stories and allow them to educate the organization. Anyone will benefit. It creates anticipation over failures. Accountability means also authority. People must have the possibility to fix things.

Questions

How to include people outside of the organizations in the process? Different vendors have different ways. If they do not align with your ways, then you have bad luck.

It might be difficult to engage the manager in that new culture.

The search for causality in our cultures has to be blamed on René Descartes.

You might want to test your infrastructure by creating failures. But you can also put the team in a disorienting scenario will look like a failure to detect what will be the challenges to fix in case this scenario appears.