Engineering Empathy in Post-Growth Startups
The "service discovery" of human connection
The first time I watched a senior engineer give their notice, I was sitting in an open-plan office surrounded by monitors. Half of them were displaying dashboards: service uptime, error rates, p99 latency, memory consumption across a dozen microservices. We knew, at any given moment, the precise health of every system we had built. We had alerts configured to wake someone up at 3 a.m. if a single service degraded past a defined threshold.
We had no equivalent visibility into the people running those systems. The engineer who resigned had been running at critical load for six months. We found this out in the exit interview.
I have thought about that moment for years, and what I keep returning to is not the failure of management, exactly, though there was that. What I keep returning to is the infrastructure gap. We were a team that cared deeply about observability. We had Prometheus scraping metrics, Grafana rendering them into dashboards, PagerDuty routing alerts to the right people at the right time. We had runbooks for every failure mode we could imagine. We had blameless postmortems when things went wrong. We had, in other words, a rigorous and sophisticated framework for understanding the state of our technical systems, and we applied approximately none of that rigor to understanding the state of our team.
This is the central contradiction of the high-performing startup: the same engineering culture that demands observability, redundancy, and graceful degradation from its software routinely builds teams with none of those properties.
In distributed systems, service discovery is the mechanism by which services locate each other on a network. The naive approach is to hardcode an address: Service A always lives at this IP, this port. It works until it doesn’t, which is usually at the worst possible time. Services scale up and down, get replaced, move across infrastructure. The hardcoded address becomes a lie the moment anything changes. Proper service discovery replaces static assumptions with dynamic queries: before communicating, a service checks a registry to find out where the thing it needs actually is right now.
The way we relate to our colleagues is almost always hardcoded. We know where to find them in the sense that we know their Slack handle and their calendar. We have a model of who they are that was formed during onboarding or a handful of 1:1s and has been gently calcifying ever since. We treat that model as an address: stable, reliable, findable at the same location it has always been. We do not run service discovery against the actual person. We do not check the registry to find out where they actually are right now, what load they are running under, whether the endpoint that used to respond in milliseconds has started timing out.
The consequences of this in a post-growth startup are specific and severe. The post-growth environment, which is to say the environment most startups have been operating in since the funding climate tightened and the easy money stopped, is characterized by reduced headcount doing increased work under elevated uncertainty. The team that once had twelve engineers now has seven. The roadmap has not shrunk proportionally. Everyone is, by any reasonable measure, running at high utilization. High utilization, in infrastructure terms, is when you want your observability to be sharpest, because a system running at 90% capacity has almost no margin before it starts degrading, and degradation under load tends to be nonlinear. A small additional pressure produces a disproportionately large failure.
The technical term for designing against this is building for graceful degradation. When load exceeds capacity, the system sheds work in a controlled way rather than collapsing entirely. You implement circuit breakers, popularized in distributed systems by teams like Netflix, which detect when a downstream service is struggling and stop sending it requests before it fails completely. You shed the less critical work first. You protect the core.
The human equivalent of a circuit breaker is a manager who notices that someone is at capacity and removes work from their plate before the system fails. This sounds obvious stated plainly. It almost never happens in practice, because the observability layer does not exist. You cannot react to a signal you cannot see. And the signals human beings emit when they are approaching failure are often counterintuitive: the overloaded engineer who gets quieter in meetings, more efficient in communication, less likely to push back on scope. From the outside, this can look like performance. It is often the last stage before the resignation letter.
What would it mean to actually apply engineering discipline to team health? It would mean defining what “healthy” looks like before you need the definition. Not in the vague language of culture decks (”we are a team that supports each other”) but in the operational language of SLOs: we do not assign more than two concurrent project threads to a single engineer; a person who has flagged overload gets scope removed within five business days; managers conduct genuine load assessments in 1:1s on a defined cadence rather than asking “how are you doing?” as a social ritual that everyone understands to mean nothing.
It would mean building runbooks for human failure modes the same way you build them for infrastructure failure modes. When a team member is showing early signs of burnout, the runbook says: first, do this. Then this. This is who you escalate to if those steps don’t resolve the situation. The runbook exists not because leadership is indifferent but because without documentation, the response to a human incident is improvised by whoever is closest to it, which produces inconsistent outcomes and guarantees that whatever happened will happen again.
It would mean treating the exit interview as the postmortem it actually is. Not as a formality to be completed before offboarding paperwork but as a serious attempt to understand what the system failed to detect and when. And it would mean, as with any good postmortem, making the process blameless. Not “what did this manager do wrong” but “what did this team’s observability layer fail to surface.”
I am autistic, which means I have spent most of my career building explicit internal models of social systems that other people navigate intuitively. It also means I have a reasonably high tolerance for making the implicit explicit, for saying the thing that everyone in the room already knows but no one has put in writing. What I know from two decades of watching teams work and fail is that the most expensive operational mistake a startup can make is not a database migration gone wrong or an API design you have to reverse. It is treating your people as infrastructure that requires no monitoring until it goes down.
Your Kubernetes cluster does not wait until it is failing to tell you it needs attention. You designed it that way on purpose. You can design your team that way too, if you decide that the people running the cluster are worth the same engineering rigor as the cluster itself.
Most organizations have not yet decided that. The exit interviews will keep being surprises until they do.


