Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was a change to the database that is used to generate a bot management config file. That file was the proximate cause for the panics. The kind of observability that would have helped here is “panics are elevated and here are the binary and config changes that preceded it,” along with a rollback runbook for it all.

Generally I would say we as an industry are more nonchalant about config changes vs binary changes. Where an org might have great processes and systems in place for binary rollouts, the whole fleet could be reading config from a database in a much more lax fashion. Those systems are quite risky actually.



I am genuinely curious (albeit skeptical!) how anyone like Cloudflare could make that kind of feedback loop work at scale.

Even only in CF’s “critical path” there must be dozens of interconnected services and systems. How do you close the loop between an observed panic at the edge and a database configuration change N systems upstream?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: