* for client-side load balancing, it's entirely possible to move active healthchecking into a dedicated service and have its results be vended along with discovery. In fact, more managed server-side load balancers are also moving healthchecking out of band so they can scale the forwarding plane independently of probes.
* for server-side load balancing, it's entirely possible to shard forwarders to avoid SPOFs, typically by creating isolated increments and then using shuffle sharding by caller/callee to minimize overlap between workloads. I think Alibaba's canalmesh whitepaper covers such an approach.
As for scale, I think for almost everybody it's completely overblown to go with a p2p model. I think a reasonable estimate for a centralized proxy fleet is about 1% of infrastructure costs. If you want to save that, you need to have a team that can build/maintain your centralized proxy's capabilities in all the languages/frameworks your company uses, and you likely need to be build the proxy anyways for the long-tail. Whereas you can fund a much smaller team to focus on e2e ownership of your forwarding plane.
Add on top that you need a safe deployment strategy for updating the critical logic in all of these combinations, and continuous deployment to ensure your fixes roll out to the fleet in a timely fashion. This is itself a hard scaling problem.
For client-side LB, moving active healthcheck outside into dedicated service, wouldn't it create more reliability issues with one more service to worry about? Are there any examples of this approach being used in the industry?
IME you end up with both; something like discrete client, LB, and controller. You can’t rely on any one component to “turn itself off.“ ex a client or LB can easily get into a “wedged” state where it’s unable to take itself out of consideration for traffic. For example, I’ve had silly incidents based on bgp routes staying up, memory errors/pressure preventing new health check results from being parsed, the file systems is going read only, SKB pressure interfering with pipes, and of course, the classic difference between a dedicated health check in point versus actual traffic. All those examples it prevents the client or LB from removing itself from the traffic path.
An external controller is able to safely remove traffic from one of the other failed components. In addition the client can still do local traffic analysis, or use in band signaling, to identify anomalous end points and remove itself or them from the traffic path.
Good active probes are actually a pretty meaningful traffic load. It was a HUGE problem for flat virtual network models like a heroku a decade ago. This is exacerbated when you have more clients and more in points.
As a reference, this distributed model it is what AWS moved to 15 years ago. And if you look at any of the high throughput clouds services or CDNs they’ll have a similar model.
From a dataplane perspective, it does mean your healthchecks are running from a different location than your proxy. So there are risks where routability is impacted for proxy -> dest but not for healthchecker -> dest.
For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)
In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.
They invented a language to avoid you imperatively updating infrastructure, but that's not what CDKTF does; it just makes it easier to materialize that declarative output.
It also makes it easier to reason about that output as you can avoid awkward iteration in your declarative spec.
my expectation is that they would either sell crucial RAM at such a low volume and/or such a high price that it would do more damage to the brand than sunsetting it and returning to it when the slowdown occurs.
litestream makes very few consistency guarantees compared to other datastores, and so I would expect most any issues found would be "working as intended".
at the end of the day with litestream, when you respond back to a client with a successful write you are only guaranteeing a replication factor of 1.
By "replication factor of 1" you mean your data is stored on local disk only, right? That matches my understanding: Litestream replication is asynchronous, so there's usually a gap of a seconds or two between your write being accepted and the resulting updated page being pushed off to S3 or similar.
Yes. the acknowledgement you're getting in your application code is that the data was persisted in sqlite on that host. There's no mechanism to delay acknowledgement until the write has been asynchronously persisted elsewhere.
I wonder if it would be possible to achieve this using a SQLite VFS extension - maybe that could block acknowledgment of a right until the underlying page has been written to S3?
> the last place couldn’t because datadog apparently bills sidecar containers as additional hosts so using sidecar proxy would have doubled our datadog bill.
the problem is that they want to apply a number of stateful/lookaside load balancing strategies, which become more difficult to do in a fully decentralized system. it’s generally easier to asynchronously aggregate information and either decide routing updates centrally or redistribute that aggregate to inform local decisions.
My cheeky answer to "how should this be regulated?" is that sports betting isn't materially different from other high-risk private investments, so it should only be available to accredited investors. Imagine if fanduels/draftkings had to verify assets and income before taking a single bet?!
"Estimates of the scope of illegal sports betting in the United States range anywhere from $80 billion to $380 billion annually, making sports betting the most widespread and popular form of gambling in America."
"AGA’s report estimates that Americans wager $63.8 billion with illegal bookies and offshore sites at a cost of $3.8 billion in gaming revenue and $700 million in state taxes. With Americans projected to place $100 billion in legal sports bets this year, these findings imply that illegal sportsbook operators are capturing nearly 40 percent of the U.S. sports betting market."
I think what would be more interesting to me is estimates on the unique number of citizens betting. Is it up? If so, how appreciably?
the article is a bit breathless, which seems par for the course for security blogs these days. And while "containers are not a security boundary" is evergreen and something AWS has been trumpeting since the beginning, they IMO should also try and make it a bit harder for your to get access to the host credentials.
I do know the ECS team highly indexes on maintaining backwards compatibility and minimizing migrations wherever possible, but this seems like a case where it's warranted.
* for client-side load balancing, it's entirely possible to move active healthchecking into a dedicated service and have its results be vended along with discovery. In fact, more managed server-side load balancers are also moving healthchecking out of band so they can scale the forwarding plane independently of probes.
* for server-side load balancing, it's entirely possible to shard forwarders to avoid SPOFs, typically by creating isolated increments and then using shuffle sharding by caller/callee to minimize overlap between workloads. I think Alibaba's canalmesh whitepaper covers such an approach.
As for scale, I think for almost everybody it's completely overblown to go with a p2p model. I think a reasonable estimate for a centralized proxy fleet is about 1% of infrastructure costs. If you want to save that, you need to have a team that can build/maintain your centralized proxy's capabilities in all the languages/frameworks your company uses, and you likely need to be build the proxy anyways for the long-tail. Whereas you can fund a much smaller team to focus on e2e ownership of your forwarding plane.
Add on top that you need a safe deployment strategy for updating the critical logic in all of these combinations, and continuous deployment to ensure your fixes roll out to the fleet in a timely fashion. This is itself a hard scaling problem.
reply