Kubernetes itself is built around mostly solid distributed system principles.
It's the ecosystem around it which turns things needlessly complex.
Just because you have kubernetes, you don't necessarily need istio, helm, Argo cd, cilium, and whatever half baked stuff is pushed by CNCF yesterday.
For example take a look at helm. Its templating is atrocious, and if I am still correct, it doesn't have a way to order resources properly except hooks. Sometimes resource A (deployment) depends on resource B (some CRD).
The culture around kubernetes dictates you bring in everything pushed by CNCF. And most of these stuff are half baked MVPs.
---
The word devops has created expectations that back end developer should be fighting kubernetes if something goes wrong.
---
Containerization is done poorly by many orgs, no care about security and image size. That's a rant for another day. I suspect this isn't a big reason for kubernetes hate here.
Id gonna break a nerve and say most orgs overengineer observability. There's the whole topology of otel tools, Prometheus tools and bunch of Long term storage / querying solutions. Very complicated tracing setups. All these are fine if you have a team for maintaining observability only. But your avg product development org can sacrifice most of it and do with proper logging with a request context, plus some important service level metrics + grafana + alarms.
Problem with all these above tools is that, they all seem like essential features to have but once you have the whole topology of 50 half baked CNCF containers set up in "production" shit starts to break in very mysterious ways and also these observability products tend to cost a lot.
The ratio of 'metadata' to data is often hundreds or thousands to one, which translates to cost, especially if you're a using a licensed service. I've been at companies where the analytics and observability costs are 20x the actual cost of the application for cloud hosting. Datadog seems to have switched to revenue extraction in a way that would make oracle proud.
Is that 20x cost... actually bad though? (I mean, I know Datadog is bad. I used to use it and I hated its cost structure.)
But maybe it's worth it. or at least, the good ones would be worth it. I can imagine great metadata (and platforms to query and explore it) saves more engineering time than it costs in server time. So to me this ratio isn't that material, even though it looks a little weird.
The trouble is that the o11y costs in developer time too. I've seen both traps:
Trap 1: "We MUST have PERFECT information about EVERY request and how it was serviced, in REALTIME!"
This is bad because it ends up being hella expensive, both in engineering time and in actual server (or vendor) bills. Yes, this is what we'd want if cost were no object, but it sometimes actually is an object, even for very important or profitable systems.
Trap 2: "We can give customer support our pager number so they can call us if somebody complains."
This is bad because you're letting your users suffer errors that you could have easily caught and fixed for relatively cheap.
There is diminishing returns with this stuff, and a lot of the calculus depends on the nature of your application, your relationship with consumers of it, your business model, and a million other factors.
Family in pharma had a good counter-question to rationally scope this:
"What are we going to do with this, if we store it?"
A surprising amount of the time, no one has a plausible answer to that.
Sure, sometimes you throw away something that would have been useful, but that posture also saves you from storing 10x things that should never have been stored, because they never would have been used.
And for the things you wish you'd stored... you can re-enable that after you start looking closely at a specific subsystem.
I agree that this is the way, but the problem with this math is that you can't, like, prove that that one thing in ten that you could have saved but didn't wouldn't have been 100x as valuable as the 9 that you didn't end up needing. So what if you saved $1000/yr in storage if you also had to throw out a million dollar feature that you didn't have the data for? There is no way to go about calculating this stuff, so ultimately you have to go by feel, and if the people writing the checks have a different feel, they will get their way.
For what it's worth, I found it almost trivial to set up open telemetry and point it honeycomb. It took me an afternoon about a month ago for a medium-sized python web-app. I've found that I can replace a lot of tooling and manual work needed in the past. At previous startups it's usually like
1. Set up basic logging (now I just use otel events)
2. Make it structured logging (Get that for free with otel events)
3. Add request contexts that's sent along with each log (Also free with otel)
4. Manually set up tracing ids in my codebase and configure it in my tooling (all free with otel spans)
Really, I was expecting to wind up having to get really into the new observability philosophy to get value out of it, but I found myself really loving this setup with minimal work and minimal koolade-drinking. I'll probably do something like this over "logs, request context, metrics, and alarms" at future startups.
I've currently done this, and I'm seriously considering undoing it in favor of some other logging solution. My biggest reason: OpenTelemetry fundamentally doesn't handle events that aren't part of a span, and doesn't handle spans that don't close. So, if you crash, you don't get telemetry to help you debug the crash.
I wish "span start" and "span end" were just independent events, and OTel tools handled and presented unfinished spans or events that don't appear within a span.
Logging solves this problem. If OTel and observability is attempting to position itself as a better alternative to logging, it needs to solve the problems that logging already solves. I'm not going to use completely separate tools for logging and observability.
Also, "crash" here doesn't necessarily mean "segfault" or equivalent. It can also mean "hang and not finish (and thus not end the span)", or "have a network issue that breaks the ability to submit observability data" (but after an event occurred, which could have been submitted if OTel didn't wait for spans to end first). There are any number of reasons why a span might start but not finish, most of which are bugs, and OTel and tools built upon it provide zero help when debugging those.
OTel logs are just your existing logs, though. If you have a way to say "whoopsie it hung" then this doesn't need to be tied to a trace at all. The only tying to a trace that occurs is when there's active span/trace in context, at which point the SDK or agent you use will wrap the log body in that span/trace ID. Export of logs is independent of trace export and will be in separate batches.
Edit: I see you're a major Rust user! That perhaps changes things. Most users of OTel are in Java, .NET, Node, Python, and Go. OTel is nowhere near as developed in Rust as it is for these languages. So I don't doubt you've run into issues with OTel for your purposes.
Unhandled exceptions is a pretty normal one. You get kicked out to your app's topmost level and you lost your span. My wishlist to solve this (and I actually wrote an implementation in Python which leans heavily on reflection) is to be able to attach arbitrary data to stack frames and exceptions when they occur merge all the data top-down and send it up to your handler.
Signal handlers are another one and are a whole other beast simply because they're completely devoid of context.
They're icky (as language design / practices) to me precisely because you end up executing context-free code. But I'd probably also just start a new trace in my signal handler / exception handler tagged with "shrug"...
How would you underengineer it? What would be a barebones setup for observability at the scale of one person with a few servers running at most a dozen different scripts?
I would like to make sure that a few recurrent jobs run fine, but by pushing a status instead of polling it.
I just find logs to be infuriatingly inconsistent & poorly done. What gets logged is arbitrary as hell & has such poor odds of showing what happened.
Where-as tracing instrumentation is fantastically good at showing where response time is being spent, showing what's getting hit. And it comes with no developer cost; the automatic instrumentation runs & does it all.
Ideally you also throw in some. additional tags onto the root-entry-span or current span. That takes some effort. But then it should be consistently & widely available & visible.
Tracing is very hard to get orgs culturally on board with. And there are some operational challenges but personally I think you are way over-selling how hard it is... There are just a colossal collection of softwares that serve as really good export destinations, and your team can probably already operate one or two quite well.
It does get a lot more complex if you want longer term storage. Personally I'm pro mixing systems observability with product performance tracking stuff, so yes you do need to keep some data indefinitely. And that can be hugely problematic: either trying to juggle storage and querying for infinitely growing data, or building systems to aggregate & persist the data & derived metrics, you need while getting rid of the base data.
But I just can't emphasize enough how badly most orgs are at logs and how not worth anyone's time it so to invest in something manual like that that offers so much less than the alternative (traces).
HUGE +1 to mixing systems observability with product data. this is an oft-missed aspect of observability 2.0 that is increasingly critical. all of the interesting questions in software are some combination and conjunction of systems, app, and business data.
also big agree that most places are so, so, so messy and bad about doing logs. :( for years, i refused to even use the term "logs" because all the assumption i wanted people to make were the opposite of the assumptions people bring to logs: unstructured, messy, spray and pray, etc.
It's the ecosystem around it which turns things needlessly complex.
Just because you have kubernetes, you don't necessarily need istio, helm, Argo cd, cilium, and whatever half baked stuff is pushed by CNCF yesterday.
For example take a look at helm. Its templating is atrocious, and if I am still correct, it doesn't have a way to order resources properly except hooks. Sometimes resource A (deployment) depends on resource B (some CRD).
The culture around kubernetes dictates you bring in everything pushed by CNCF. And most of these stuff are half baked MVPs.
---
The word devops has created expectations that back end developer should be fighting kubernetes if something goes wrong.
---
Containerization is done poorly by many orgs, no care about security and image size. That's a rant for another day. I suspect this isn't a big reason for kubernetes hate here.