More

grogers · 2025-12-07T13:08:36 1765112916

LLM only fairly recently underwent a step change from "maybe someday" to actually useful now. That opened many new doors that people didn't even think were possible. Getting incrementally better at something they are already pretty good at isn't that impressive. But getting drastically better at something they are currently bad at, will drive new models and new research.

grogers · 2025-12-02T15:30:57 1764689457

Yeah, -O3 generally performs well in small benchmarks because of aggressive loop unrolling and inlining. But in large programs that face icache pressure, it can end up being slower. Sometimes -Os is even better for the same reason, but -O2 is usually a better default.

grogers · 2025-11-26T13:07:10 1764162430

It'd be very hard for the compiler to enforce constant-time execution for generic code. As an example, if you wrote the naive password checking where the first byte that doesn't match returns false, is that a compiler error if it can't transform it into a constant time version?

grogers · 2025-11-19T14:28:16 1763562496

Of course it depends on the situation. But I don't see how you could think that in this case, crashing is better than stale config.

Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).

ergocoder · 2025-11-20T02:41:09 1763606469

Because stale config could easily go unnoticed for a long time.

Crashing is generally better than behaving incorrectly due to stale configs. Because the problem would get fixed faster.

grogers · 2025-11-19T14:05:15 1763561115

Instead of crashing when applying the new config, it's more common to simply ignore the new config if it cannot be applied. You keep running in the last known good state. Operators then get alerts about the failures and can diagnose and resolve the underlying issue.

That's not always foolproof, e.g. a freshly (re)started process doesn't have any prior state it can fall back to, so it just hard crashes. But restarts are going to be rate limited anyways, so even then there is time to mitigate the issue before it becomes a large scale outage

grogers · 2025-11-15T17:16:37 1763226997

It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.

It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.

grogers · 2025-11-15T16:24:29 1763223869

Autocommit mode is pretty handy for ad-hoc queries at least. You wouldn't want to have to remember to close the transaction since keeping a transaction open is often really bad for the DB

grogers · 2025-11-10T23:19:23 1762816763

Use after free is number 8 and 9 on the lists respectively... Not like it's way down the list. And most of the things the things above it (other than out of bounds writes/reads which you addressed in another comment) are not something I would consider a programming language can directly affect (e.g. SQL injection, CSRF)

pron · 2025-11-10T23:51:32 1762818692

> Not like it's way down the list

I'm not saying it's not important, but we do have to consider whether it's worth the cost.

> And most of the things the things above it (other than out of bounds writes/reads which you addressed in another comment) are not something I would consider a programming language can directly affect (e.g. SQL injection, CSRF)

It actually can.

grogers · 2025-11-08T21:51:22 1762638682

If a node thinks it's better suited as leader, it can always force an election immediately for the next term. Things could go badly if you're wrong though

hinkley · 2025-11-09T22:22:40 1762726960

There are variants that have members than can vote but can’t lead, and I believe also ones where there are silent shareholders that are aware of part of the state of the system but don’t vote. Those would be particularly useful for autoscaling groups, where you’re not affecting the quorum count.

I think consul’s sidecar works this way but I’ve ever set it up, only used it.

grogers · 2025-11-02T20:57:35 1762117055

Real world systems often have to deviate from the "pure" version used to run formal methods on. This could be how long you keep transaction logs for, or how long rows are tombstoned for, etc. The longer the time period, the costlier it usually is, in total storage cost and sometimes performance too. So you have to compromise with where you set the time period for.

Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.

withinboredom · 2025-11-02T22:21:16 1762122076

In my mind, anything that has an actual time period is bound to fail, eventually. Then again, I hang around QA engineers a lot, and when you hear about the selenium troubles of "wait until an element is on the page" stories, you realise it relates to software in general.

QA people deal with problems and edge cases most devs will never deal with. They’re your subject-matter-experts of 'what can go wrong'.

Anyway, the point is. You can’t trust anything "will resolve in time period X" or "if it takes longer than X, timeout". There are so many cases where this is simply not true and should be added to a "myths programmers believe" article if it isn't already there.

sigseg1v · 2025-11-02T23:08:44 1762124924

>You can't trust anything "will resolve in time period X"

As is, this statement just means you can't trust anything. You still need to choose a time period at some point.

My (pedantic) argument is that timestamps/dates/counters have a range based on the number of bits storage they consume and the tick resolution. These can be exceeded, and it's not reasonable for every piece of software in the chain to invent a new way to store time, or counters, etc.

I've seen a fair share of issues resulting from processes with uptime of over 1 year and some with uptime of 5 years. Of course the wisdom there is just "don't do that, you should restart for maintenance at some point anyway" which is true, but it still means we are living with a system that theoretically will break after a certain period of time, and we are sidestepping that by restarting the process for other purposes.

withinboredom · 2025-11-03T06:48:23 1762152503

You can have liveness without a timeout. Think about it. Say you set a timeout of 1 minute in your application to transfer 500 mb over a 100mbps link. This normally takes 40s and this is that machines sole job, so it fails fast.

One day, an operator is updating some cabling and changes you over to a 10mbps link for a few hours. During this time, every single one of your transfers is going to fail even though if you were to inspect the socket, the socket is still making progress on the transfer.

This is why we put timeouts on the socket, not the application. The socket knows whether or not it is still alive but your application may not.

saurik · 2025-11-03T00:05:39 1762128339

Yeah... it has felt kind of ridiculous over the years how many times I have tracked some but I was experiencing down to a timeout someone added in the code for a project I was working with, and I have come to the conclusion over the years that the fix is always to remove the timeout: the existence of a timeout is, inherently, a bug, not a feature, and if your design fundamentally relies on a timeout to function, then the design is also inherently flawed.

darkwater · 2025-11-03T11:24:45 1762169085

How would you handle the case when some web service is making calls to a 3rd-party and that 3rd-party is failing in unexpected ways (i.e. under high load or IPs are not answering due to routing issues) to avoid a snowball effect on your service without using the timeout concept in any way?

withinboredom · 2025-11-03T15:19:11 1762183151

You put the timeout on the socket, not your application. Your application shouldn't care how long it takes, as long as progress is being made, which the socket will know about, but you won't. If you put a timeout on your application and then retry, you'll just make the problem worse. Your original packets are still in a buffer somewhere and still will be processed. Retrying won't help the situation.

saurik · 2025-11-06T18:33:27 1762454007

The socket should also not have a timeout.

withinboredom · 2025-11-06T19:15:01 1762456501

Sockets actually need a timeout because there is no signal that a client has disconnected. Eventually, maybe, a router along the path will be nice enough to send you a RST packet, but it isn’t guaranteed.

saurik · 2025-11-06T20:13:52 1762460032

People put a lot of timeouts in code when there are humans in the loop that should handle the timeout. An outgoing socket (as is the case in this scenario) really should not have a timeout.

An incoming one might could have a timeout if there is no other way to garbage collect the connection, but, if at all possible, that should usually be in the higher layers, not the lower ones.

(Maybe read my other response to the person you responded to? I purposefully gave you a really short and matter-of-fact statement that fit into the discussion from the thread more broadly.)

withinboredom · 2025-11-06T20:47:13 1762462033

I’m explicitly saying not to put timeouts in code… but you must put a timeout on a socket due to the way they work. Period. Or deal with the default, which is usually many minutes. Sockets timeout when packets haven’t been acknowledged for a long time, but you can also set an idle timeout as well.

A timeout on sockets isn’t negotiable.

saurik · 2025-11-07T03:00:54 1762484454

I continue to disagree: the socket does not need a timeout, it can simply go into an infinitely held state. Take a web browser (a very typical "outgoing socket" case): there is no value in either the browser or the socket having a timeout, as, if the user decides it takes too long, they will click Stop and/or Reload, which will close the socket. "I guess the remote side didn't send me a response packet within X seconds so I'll automatically stop the load and show the user an error" does not provide any benefit and can only lead to new failure edge cases.

withinboredom · 2025-11-07T06:56:31 1762498591

I’m talking about the physical socket in the kernel here. Not a hypothetical one. You can send packets (literally pulses of electricity) down it, but you don’t know if anything happened until you get packets back. By default, this is around half an hour, basically far longer than any human would reasonably wait.

My point is, you have to set this or accept the default timeout. The default is more than reasonable, anything less than minutes — with an s — is unreasonable.

saurik · 2025-11-06T18:33:07 1762453987

How does the timeout help? Expose the lack of progress to the user and give them a way to give up; if they choose to walk away, then you stop. The only timeout should be in the head of a human that can make real decisions about how long too long is. The real problem: me knowing that if the software would have waited a bit longer, it would have worked. Your timeouts just cause more busy work and are often the root cause of snowball effects.

baq · 2025-11-03T07:50:39 1762156239

My hypothetical pitch deck tile slide: setTimeout() on a vector clock. I can hear Lamport’s scream from here and I live far away