Or The Agency Problem Within Engineering Teams.
A while back I decided it’s time for a change and went to find my second engineering adventure. As I’ve been enjoying a month off in between the
jobs and subjecting myself to the brain dump of Nassim Taleb’s Skin In
I realized there’s one main learning I got both out of the job hunt and my
TL;DR — Engineering work suffers if technical decision makers
don’t participate in experiencing the consequences of their decisions.
Let me illustrate with these examples :
This team is responsible for helping developers deploy and operate their
software. When asked about the biggest challenge they are facing they do
not mention Kubernetes scaling issues (be honest, that’s what you
expected to read here, right?) but instead explain being stuck with the
architects. Apparently it’s not uncommon for the architects to make
design decisions which Team ABC knows upfront will lead to operational
issues down the (pretty short) line.
The Team ABC, not being able to convince the architects to amend their
designs, watches the changes being implemented, bake for a while, and
take down the production. Team ABC helps mitigate and hotfix the issues,
goes back for round two of discussions with the architects and spends a
lot of energy trying to get a systems design that will actually work.
Team XYZ is in charge of developing and operating a cloud native
infrastructure service Alpha. They get temporary backup from engineers
from another team to deliver new functionalities. Those engineers come
back from the project and share their surprise of finding out that
service Alpha does not have local development environments or a
comprehensive end-to-end testing suite. This recently resulted in
downtime or service impairment caused by the deployment of the new
functionality developed by the temporary backup engineers.
To explore the potential for relieving some engineering pain and
improving production stability Team XYZ is asked about the lack of the
local environments. One of the voices mentions those were not
prioritized in the roadmap due to the belief of some of the decision
makers that all engineers deploying to production should understand
their changes well enough to avoid any deployment issues .
The follow up question to establish the agency and power dynamic comes
naturally — “When was the last time said decision makers deployed code
to production?” — and earns a good round of chuckles.
Now let’s take a step back and look at some references.
Skin In The Game
The aforementioned Taleb uses the following distinction to explain the
asymmetry of risk taking (I selected my favourite one out of a list,
which I believe —with a grain of salt— to be least Taleb-style
biased and most representative of the idea):
No skin in the game: Keeps upside, transfers downside to others, owns a hidden option at someone else’s expense — e.g. bureaucrats, policy wonks.
Skin in the game: Keeps his own downside, takes his or her own risk — e.g. citizens.
Skin in the game of others, or soul in the game: Takes the downside on behalf of others, or for universal values — e.g. saints, knights, warriors, soldiers.
Taleb then follows with a host of examples and anecdotes about finance
executives taking their bonuses after tanking the very banks they were
leading in 2008, consultants who don’t get to experience the
consequences of their own advice, or politicians who cut education
spending while sending their kids to private schools.
The main point Taleb is making is as follows: asymmetries in the risk
taking within a society or organisation lead to creating rotten systems,
in which people don’t suffer the negative consequences of their
decisions, and therefore are not properly motivated to care about those
consequences, i.e. they don’t have the “skin in the game”.
You may feel like saying “Well, we’ve got a lot of work to do and we’re
short-staffed, so there’s no time for philosophizing about motivations
and incentives”. If you’re tempted to say this I’d urge you to pause for
a second and consider these:
1. What is the cost of downtime of service impairment for your business?
Say you’re business-critical systems have been at 99.9% availability
last year and half of that downtime (roughly 4.3 hours) was due to a bad
design decision. How much did it cost your company?
2. You’ve got 10 engineers on your team. Your engineers can’t spin up a
local development environment and instead have to deploy their work in
progress to a shared environment, spending half a day instead of 5
minutes to inspect the effect of their code changes on the whole system.
How much of their salary is wasted coordinating lengthy deploys into
test environments weekly, monthly, yearly? How much is that costing your
3. You have your architect, senior SRE and a senior Dev struggle to come
to an agreement on system design in a two hour meeting. This happens
twice a month, and still does not prevent the downtime. How much does
that affect your bottom line?
All this is to say that taking the time to reflect on and improve the
alignment of your team on high development velocity with minimal risk of
downtime will pay for itself. It is not just affecting how much people
like working with each other (which isn’t necessarily
business-critical), it’s directly affecting the very job they were
hired to do — to serve the customers and bring value to the business.
Consider The Alternatives
Bring the skin to the game of decision makers by having them deploy code
regularly, or by giving them a pager and adding them to the oncall
rotation. Watch how their decisions change when they know they may be
the ones waking up at 2AM to resolve a production incident.
If for some reason it’s not possible for them to participate in the
regular development work or on call rotation, then at least make sure
that “removing engineering pain” is a part of their performance
Beware of handing the bulk of the decision making power to people who
are not participating in the everyday engineering work. Their
incentives, motivations, and goals may hinder your whole team’s