It Was the Logs All Along

A service started getting slow. So it logged more. Verbose error context on every failed request, full stack traces, the entire request payload, all of it written to disk so that we could debug the slowness later. This seemed not just reasonable but responsible.

The spiral

The extra logging filled the log volume faster than the log shipping agent could drain it. The agent, unable to ship fast enough, began buffering log lines in memory so as not to lose them. The growing memory buffer added memory pressure to a host that was already under load.

The memory pressure triggered the kernel out-of-memory killer. The OOM killer, looking for the largest, hungriest process to sacrifice, chose the service itself. The service died. Orchestration restarted it. The fresh instance came up into the same flood of traffic, failed the same requests, and logged the same verbose errors, only now faster, because a cold cache fails more.

More failures meant more error logs. More error logs filled the volume faster. The volume filling faster meant more buffering. More buffering meant more memory pressure. More memory pressure meant another visit from the OOM killer.

The realization

We sat watching the dashboards as the cycle tightened, and someone said it out loud. We had built a machine whose only purpose, in its final hours, was to describe in exquisite detail the manner of its own death. Every log line was both a clue and a cause. The harder it tried to explain itself, the faster it died.

What we changed

We capped log volume per instance and made the logging path drop lines rather than buffer without limit when shipping falls behind. We moved log shipping off the host’s critical resources. And we rate-limited error logging so that a service drowning in failures does not also drown in its own description of them.