More

karmarepellent · 2025-09-22T17:06:55 1758560815

This is why I've become a fan of StrictYAML [0]. Of course it is not supported by many projects, but at least you are given the option to dispense with all the unnecessary features and their associated pitfalls in the context of your own projects.

Most notably it only offers three base types (scalar string, array, object) and moves the work of parsing values to stronger types (such as int8 or boolean) to your codebase where you tend to wrap values parsed from YAML into other types anyway.

Less surprises and headaches, but very niche, unfortunately.

[0] https://hitchdev.com/strictyaml/

karmarepellent · 2025-09-02T17:32:25 1756834345

A service that lets you sign up by uploading a SSH public key could just as well let you upload multiple public keys in your profile to be able to connect from other devices.

tadfisher · 2025-09-02T17:37:29 1756834649

Amazing, just like passkeys!

Nextgrid · 2025-09-02T21:39:19 1756849159

Biggest difference is that SSH keys allow you to store and submit the public key without the private key being present.

With passkeys, the private key must be present and usable (at least with current implementations) at the time of enrolment.

This raises a major problem: with SSH keys you can keep an backup key in a secure location (bank vault, etc) and still be able to register it. With passkeys your backup key must be present and connected when registering it, so you can’t keep it in a secure location as you always need it when registering. This exposes both keys to risks such as hardware failure (let’s say faulty USB port that spikes anything plugged in with 12V… you connect your main key, it doesn’t work, now you connect your backup key and same thing happens… by the time you realize both your primary and backup keys are toast).

tadfisher · 2025-09-03T02:11:35 1756865495

With SSH, "registering" your key on a server means having out-of-band access to copy your public key. There is no such facility if you're registering a never-before-seen user with a new key, so it makes a whole heap of sense to ensure that the credential you're registering has a working private key that exists.

karmarepellent · 2025-09-02T17:41:58 1756834918

The sarcasm is duly noted. But I simply answered the question. I don't have any strong opinion regarding passkeys.

karmarepellent · 2025-09-02T17:26:34 1756833994

This is incorrect. SSH certificates work just like x509 certificates in that regard. Also, with PubkeyAuthentication, there exist all kinds of ways to collect host keys before connecting to them for the first time and thus avoiding the trust-on-first-use problem. Especially in private networks where you control all the nodes.

palata · 2025-09-02T20:30:50 1756845050

SSH does have certificates, but in practice most people using SSH don't use SSH certificates and don't check the fingerprints.

Not sure if we can say it's solved if nobody wants to use it by choice (certificates are probably mostly used in enterprise setups, but in my experience it's not even that common there).

tptacek · 2025-09-02T22:17:31 1756851451

If you have a small, stable number of hosts, an SSH PKI doesn't make a lot of sense. With a large fleet, and/or if you want to tie your fleet into an OIDC IdP, certificates are pretty common; the most common way of solving this problem, I think?

palata · 2025-09-02T23:40:49 1756856449

I think it's the case in big companies. But most companies are not big :-), which means that a lot of people are using SSH without ever checking the fingerprint. That would be my intuition.

tptacek · 2025-09-03T00:06:49 1756858009

SSH has always relied on key continuity for this problem; you're exposed when you're first introduced to a host (on a particular client) but then fine from that point on.

This of course breaks down with cattle fleets where ~most logins are to hosts you've never hit before, which is why cattle fleets tend to use SSH PKI.

palata · 2025-09-03T08:47:55 1756889275

Over the years I have seen - repeatedly - colleagues just removing ~/.ssh/known_hosts when SSH showed the warning that says something like "YOU MAY HAVE BEEN HACKED!!!".

I think passkeys resolve that, even though it's more of a human issue than a technical issue :-).

vbezhenar · 2025-09-03T08:06:37 1756886797

When I connect to github using ssh, I must google github page with ssh fingerprint and verify it by hand. Imagine how many people actually do that, instead of blindly accepting the key.

If github can't make it right, nobody can.

karmarepellent · 2025-09-02T17:24:35 1756833875

> Signing in is cryptographically signing a commitment to the current ephemeral tunnel.

I can see how SSH could be used for authentication on the web. And I have no doubt that it would be sound out-of-the-box. But I am not sure what you mean by your last sentence. Do you mean that authentication targets are gated and only reachable by establishing a tunnel via some kind of forwarding?

Aside from the wonderful possibilities that are offered by using port forwarding of some kind, you could also simply use OpenSSH's ForceCommand to let users authenticate via SSH and then return a short-lived token that can then be used to log into an application (or even a SSO service).

I guess no one uses SSH for authentication in this way because it is non-standard and kind of shuts out non-technical people.

alphazard · 2025-09-02T17:45:25 1756835125

> authentication targets are gated and only reachable by establishing a tunnel via some kind of forwarding?

No, it's just how you authenticate with signing keys. Given that a secure channel has been set up with ephemeral keys, you can sign a commitment to the channel (like the hash of the shared secret key) to prove who you are to the other party.

> let users authenticate via SSH and then return a short-lived token that can then be used to log into an application (or even a SSO service)

This is exactly what I recommend. If everyone did this, then eventually then the browsers or 1password could support it.

palata · 2025-09-02T20:26:18 1756844778

The thing is, if you want to use SSH with a secure element, suddenly you're using FIDO2, right? OpenSSH already supports it.

And WebAuthn is using FIDO2, it's not that different, it's just that WebAuthn adds some stuff like a relying party.

NoGravitas · 2025-09-03T16:10:11 1756915811

It's the stuff it adds that most people object to.

palata · 2025-09-03T22:25:57 1756938357

It feels more like people object for the sake of objecting. I often have this feeling: people first don't care (about security, about renewable energy, ...), and the moment they start caring, there is a high risk that they will just object to everything, sometimes without good reasons.

Sure, we are being abused by TooBigTech and surveillance capitalism. It doesn't mean that all security is bad. Security is a compromise. Yet many people go "this added security comes from a governement/TooBigTech so it proves that it is a lie". Which is wrong: it doesn't prove it. Sometimes there are good things coming from governments/TooBigTech.

The world is more nuanced than people seem to realise.

manithree · 2025-09-02T17:57:31 1756835851

Not just non-technical people, but a lot of Windows developers I've worked with over the years can't seem to grasp the asymmetric key concept enough to use it for git (and then complain about git over over https).

Being in charge of the strength and security of your private key is something most people don't want to do, so we get multiple identities made "easy" by walled gardens getting popular in passkeys.

karmarepellent · 2025-05-22T13:46:12 1747921572

I'm curious to know if people see this as a viable alternative to a PXE installation, especially when it comes to the deployment of large-ish (possibly air-gapped) clusters.

evanjrowley · 2025-05-22T15:18:29 1747927109

The answer file approach is common with automated Windows installations: https://learn.microsoft.com/en-us/windows-hardware/manufactu...

I can see the value for both air-gapped deployments and testing Proxmox itself using nested virtualization.

karmarepellent · on Nov 27, 2024

Its a matter of evaluating what kind of infrastructure your application needs to run on. There are certainly mission critical systems where even a sliver of downtime causes real damage, like lost revenue. If you come to the conclusion that this application and everything it involves better run on k8s for availability reasons, you should probably focus on that and code your application in a k8s-friendly manner.

But there are tons of applications that run on over-engineered cloud environments that may or may not involve k8s and probably cost more to operate than they must. I use some tools every day where a daily 15 min downtime would not affect my or my work in the slightest. I am not saying this would be desirable per se. Its just that a lot of people (myself included) are happy to spend an hour of their work day talking to colleagues and drinking coffee, but a 15 min downtime of some tool is seen as an absolute catastrophe.

karmarepellent · on Nov 27, 2024

Agreed. The best thing we did back when we ran k8s clusters, was moving a few stateful services to dedicated VMs and keep the clusters for stateless services (the bulk) only. Running k8s for stateless services was an absolute bliss.

At that time stateful services were somewhat harder to operate on k8s because statefulness (and all that it encapsulates) was kinda full of bugs. That may certainly have changed over the last few years. Maybe we just did it wrong. In any case if you focused on the core parts of k8s that were mature back then, k8s was (and is) a good thing.

karmarepellent · on Nov 27, 2024

I think the value proposition holds when you are just getting started with your company and you happen to employ people that know their way around the hyperscaler cloud ecosystems.

But I agree that moving your own infra or outsourcing operations when you have managed to do it on your own for a while is most likely misguided. Speaking from experience it introduces costs that cannot possibly be calculdated before the fact and thus always end up more complicated and costlier than the suits imagined.

In the past, when similar decicions were made, I always thought to myself: You could have just hired one more person bringing their own, fresh perspective on what we are doing in order to improve our ops game.

wyclif · on Nov 28, 2024

Oh, I've seen this before and it's true in an anecdotal sense for me. One reason why is that they always think of hiring an additional developer as a cost, never savings.

karmarepellent · on Nov 27, 2024

We ran only two (very small) clusters for some time in the past and even then it introduced some unnecessary overhead on the ops side and some headaches on the dev side. Maybe they were just growing pains, but if I have to run Kubernetes again I will definitely opt for a single large cluster.

After all Kubernetes provides all the primitives you need to enforce separation. You wouldn't create separate VMWare production and test clusters either unless you have a good reason.

tinco · on Nov 27, 2024

You need a separate cluster for production because there are operations you'd do your staging/QA environments that might accidentally knock out your cluster, I did that once and it was not fun.

I completely agree with keeping everything as simple as possible though. No extra clusters if not absolutely necessary, and also no extra namespaces if not absolutely necessary.

The thing with Kubernetes is that it was designed to support every complex situation imaginable. All these features make you feel as though you should make use of them, but you shouldn't. This complexity leaked into systems like Helm, which why in my opinion it's better to roll your own deployment scripts rather than to use Helm.

karmarepellent · on Nov 27, 2024

Do you mind sharing what these operations were? I can think of a few things that may very well brick your control plane. But at the very least existing workloads continue to function in this case as far as I know. Same with e.g. misconfigured network policies. Those might cause downtimes, but at least you can roll them back easily. This was some time ago though. There may be more footguns now. Curious to know how you bricked your cluster, if you don't mind.

I agree that k8s offers many features that most users probably don't need and may not even know of. I found that I liked k8s best when we used only a few, stable features (only daemonsets and deployments for workloads, no statefulsets) and simple helm charts. Although we could have probably ditched helm altogether.

thephyber · on Nov 27, 2024

You can’t roll back an AWS EKS control plane version upgrade. “Measure twice, cut once” kinda thing.

And operators/helm charts/CRDs use APIs which can and are deprecated, which can cause outages. It pays to make sure your infrastructure is automated with Got apps, CICD, and thorough testing so you can identify the potential hurdles before your cluster upgrade causes unplanned service downtime.

It is a huge effort just to “run in place” with the current EKS LTS versions if your company has lots of 3rd party tooling (like K8s operators) installed and there isn’t sufficient CICD+testing to validate potential upgrades as soon after they are released.

3rd party tooling is frequently run by open source teams, so they don’t always have resources or desire/alignment to stay compatible with the newest version of K8s. Also, when the project goes idle/disbands/fractures into rival projects, that can cause infra/ops teams time to evaluate the replacement/ substitute projects which are going to be a better solution going forward. We recently ran into this with the operator we had originally installed to run Cassandra.

thephyber · on Nov 27, 2024

`s/Got apps/GitOps/`

tinco · on Nov 27, 2024

In my case, it was the ingress running out of subdomains because each staging environment would get its own subdomain, and our system had a bug that caused them to not be cleaned up. So the CI/CD was leaking subdomains, eventually the list became too long and it bumped the production domain off the list.

oblio · on Nov 27, 2024

Kubernetes upgrades? Don't those risk bricking everything with just 1 environment?

karmarepellent · on Nov 27, 2024

In theory: absolutely. This is just anecdata and you are welcome to challenge me on it, but I have never had a problem upgrading Kubernetes itself. As long as you trail one version behind the latest to ensure critical bugs are fixed before you risk to run into them yourself, I think you are good.

Edit: To expand on it a little bit. I think there is always a real, theoretical risk that must be taken into account when you design your infrastructure. But when experience tells you that accounting for this potential risk may not be worth it in practice, you might get away with discarding it and keeping your infra lean. (Yes, I am starting to sweat just writing this).

mst · on Nov 27, 2024

"I am cutting this corner because I absolutely cannot make a business case I believe in for doing it the hard (but more correct) way but believe me I am still going to be low key paranoid about it indefinitely" is an experience that I think a lot of us can relate to.

I've actually asked for a task to be reassigned to somebody else before now on the grounds that I knew it deserved to be done the simple way but could not for the life of me bring myself to implement that.

(the trick is to find a colleague with a task you *can* do that they hate more and arrange a mutually beneficial swap)

karmarepellent · on Nov 27, 2024

Actually I think the trick is to change ones own perspective on these things. Regardless of how many redundancies and how many 9's of availability your system theoretically achieves, there is always stuff that can go wrong for a variety of reasons. If things go wrong, I am faster at fixing a not-so-complex system than the more complex system that should, in theory, be more robust.

Also I have yet to experience that an outage of any kind had any negative consequences for me personally. As long as you stand by the decisions you made in the past and show a path forward, people (even the higher-ups) are going to respect that.

Anticipating every possible issue that might or might not occur during the lifetime of an application just leads to over-engineering.

I think rationalizing it a little bit may also help with the paranoia.

pickle-wizard · on Nov 27, 2024

At my last job we had a Kubernetes upgrade go so wrong we ended up having to blow away the cluster and redeploy everything. Even a restore of the etcd backup didn't work. I couldn't tell you exactly what went wrong, as I wasn't the one that did the upgrade. I wasn't around of the RCA on this one. As the fallout was straw that broke the camels back, I ended up quitting to take a sabbatical.

merpkz · on Nov 27, 2024

Why would those brick everything? You update node one by one and take it slow, so issues will become apparent after upgrade and you have time to solve those - whole point of having clusters comprised of many redundand nodes.

karmarepellent · on Nov 27, 2024

I think it depends on the definition of "bricking the cluster". When you start to upgrade your control plane, your control plane pods restart one after one, and not only those on the specific control plane node. So at this point your control plane might not respond anymore if you happen to run into a bug or some other issue. You might call it "bricking the cluster", since it is not possible to interact with the control plane for some time. Personally I would not call it "bricked", since your production workloads on worker nodes continue to function.

Edit: And even when you "brick" it and cannot roll back, there is still a way to bring your control plane back by using an etcd backup, right?

mrweasel · on Nov 27, 2024

Not sure if this has changed, but there have been companies admitting to simply nuking Kubernetes clusters if they fail, because it does happens. The argument, which I completely believe, is that it's faster to build a brand new cluster than debugging a failed one.

nasmorn · on Nov 29, 2024

I had this happen on a small scale and it scared me a lot. It felt like your executable suddenly falling apart and you now need to fix it in assembly. My takeaway was that the k8s abstraction is way leakier than it is made out to be

karmarepellent · on Nov 14, 2024

I have use cases for both approaches (letting a reverse proxy handle TLS, letting the application listen on an external socket and handling TLS in the application).

I find is is easier to configure an application with a reverse proxy in front when different paths require e.g. different cache-control response headers. At the end of the day I do not want to replicate all the logic that nginx (and others) already provide when it integrates well with the application at the back.

Other commenters suggest that both ways (with or without additional reverse proxy) add "tons of complexity". I don't see why. Using a reverse proxy is what we have done for a while now. Installation and configuration (with a reasonable amount of hardening) is not complex and there exist a lot of resources to make it easier. And leaving the reverse proxy out and handling TLS in the application itself should not be "complex" either. Just parse a certificate and private key and supply them to whatever web framework you happen to use.

eptcyka · on Nov 14, 2024

And implement cert reloading if your application reaches any kind of respectable uptime.