The Silent Sampler

We turned on trace sampling to save money. One percent. It seemed sensible. We were paying for a hundred times the traces we ever actually looked at, and the ones we looked at were almost always the normal, boring, successful requests anyway. The bill dropped. Finance was happy. We were happy.

The quiet months

For three months nothing went wrong, which is exactly the kind of stretch that makes you forget a decision was ever a tradeoff. The sampled traces were plenty for our weekly performance reviews. We optimized a few slow endpoints. We felt good about the system and good about the savings.

The bug

Then a rare bug began corrupting orders. Not often. Maybe one request in two thousand. Enough to matter, enough to generate angry tickets, not enough to show up as a meaningful blip on any aggregate graph.

We went to the traces to find it. This is what tracing is for. This is the whole promise. Follow the broken request through every service and see where it goes wrong.

There were no traces. There were never going to be any traces. The one-in-two-thousand bug and the one-in-a-hundred sample rate had, statistically, almost never been in the same room together. We had perfect, detailed, cost-effective visibility into all the requests that were completely fine, and total darkness over the only requests we needed to see.

What we changed

We switched to tail-based sampling so the decision to keep a trace happens after the request finishes, when we already know whether it errored or ran slow. We set errors and high-latency requests to always be kept. We still sample the boring successes. We no longer let a cost knob quietly decide that the rare and the broken are the first things worth throwing away.