We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.
Here's what we think is happening:
Before the fix (pre-15.12.4):
1. Failover starts
2. Both instances accept and process writes simultaneously
3. Failover eventually completes after the writer steps down
4. Result: Potential data consistency issues ???
After the fix (15.12.4+):
1. Failover starts
2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests
3. Both instances restart/crash
4. Failover fails or requires manual intervention
The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue.
This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.
The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.
AWS Support initially pushed back and suggested it's because of high replication lag but they were looking at metrics that were more than 24 hours old. What kind of failure did you encounter? I really want to understand what edge case we triggered in their failover process - especially since we could not reproduce it in other regions.
I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.
While it's tough if you want new drives, I've found I could frequently get used drives on eBay that have significant history on Backblaze's report. Despite the increased risk from used drives, I've found I still end up more reliable than buying random new drives.
There is a 3 part hash going on. There is an Origin ID hash, a URL hash and then an MD5 on the actual payload. When a new asset is registered on the mesh the Edgemesh backplane downloads the asset direct to confirm the MD5. If it doesn't match it won't allow the asset to register. On a replication the destination node receives the asset and calc's the MD5 again. If the MD5 doesn't match - it signals Edgemesh who then takes that node (source) out of the mesh. E.g. if you modify an asset and attempt to replicate it - the receiving party will invalidate the object and signal back to Edgemesh. Replication directions are from the Edgemesh backplane. PM me if you'd like to go into this in more detail.
> In 1996, Dobbertin announced a collision of the compression function of MD5 (Dobbertin, 1996). While this was not an attack on the full MD5 hash function, it was close enough for cryptographers to recommend switching to a replacement, such as SHA-1 or RIPEMD-160.
:) You're dead right and it's why we use it inside two other top level hashes (e.g. you'd need to collide inside the OriginID space as well). It's certainly possible though (for extremely large sites) and we're experimenting with an xxHash64 implementation for a later release.
>if you modify an asset and attempt to replicate it - the receiving party will invalidate the object and signal back to Edgemesh
If I understand you explanation correctly, the receiving party will invalidate the object if the MD5 of the object doesn't match the advertised MD5? That would leave you open to people serving other objects with the same MD5 hash as the original.
You can but our backplane won't know about you local modifications. When you're client informs the backplane (on a sync) it will see that those IDs and hashes we're registered and it will instruct you client to delete them.
E.g. modifications that happen in your local instances are checked against our backplane. If an asset hasn't been registered (and verified independently via our backplane) it won't be available for replication
I'm working on a platform (Peerweb) similar to the product being discussed, and I think I've put more thought into the security and autonomous self-policing aspects of P2P CDNs. I don't waste my time with MD5, and I deeply considered the PKI that I designed.
Also, my platform can offload all assets including the page itself and enables sites to get free failover during content server downtime. Due to my DNS-seeded PKI, your users stay secure and content continues to be correctly authenticated in your P2P CDN cache even when your site would normally be down.
Ah I see, I forgot that in the SSL attack the attacker had to choose both certificate prefixes as opposed to just one. Thanks!
It does seem to me though that if I could coerce/direct the site into accepting one image that I created, I could manage to replicate a second, different file throughout the network. Obviously assuming I computed both images ahead of time and both image formats were unperturbed by the nonsense appended to file by the attack.
When you register a new asset, the Edgemesh backend downloads it from origin itself to validate the hash you've calculated. And on replication the destination recalculates it on the payload (to make sure the asset replicated correctly).
Right. So let's say we have file A, which is an innocuous image file, and file A', which is a malicious image file, where MD5(A) == MD5(A'). Based on the MD5 prefix collision attack, I should be able to construct two such files A and A'.
I get an edgemesh site to accept file A (perhaps the site allows me to upload a user avatar, upload an image on a forum, etc). I then behave as a node in the mesh, and receive file A. When I get a request to replicate file A to someone else, I send them file A', they check the MD5 hash, and the hash matches. Not seeing how that doesn't work?
It is admittedly a narrow attack, but I think it works.
Thanks for the analysis -- it is good that people have this context in their heads when designing systems. The missing conversation from this article is that some people conflate scalability with performance. They are different, and you absolutely trade one for the other. At large scale you end up getting performance simply from being able to throw more hardware at it, but it takes you quite a while to catch up to where you would have been on a single machine.
This is true not just for computing algorithms, but for developer time/brain space as well. Single-threaded applications are far simpler to understand.
The takeaway shouldn't be "test it on a single laptop first", but rather "will the volume/velocity of data now/in the future absolutely preclude doing this on a single laptop". At my work, we process probably a hundred TB in a few-hour batch processing window at night, Terabytes of which remain in memory for fast access. There is no choice there but to pay the overhead.
It’s simple – here’s how it works:Say a community is built in Year 1.The community’s streets need to be rebuilt every 30 years.In Year 30 a new, identical community is built. Now twice the amount of taxes are coming, and so for time being the property owners only need to pay half the amount.And 30 years later, in Year 60, two new communities are built; as long as the number of properties and property taxes are doubling every 30 years, they can continue to pay half the amount.
Year 1, one community (community A), over the next 30 years is going to pay for 1 community's worth of roads (call it a 30 year loan given on day 1, let's say 1M dollars). So, the cost for community A is 1M dollars for 30 community-years of roads.
Year 30, we add a new community B to the mix. This community needs its own roads, so it needs a loan for 1M dollars it will pay off over the next 30 years. However, community A's roads have worn out. Community A just finished paying off their first loan, so they'll need a new one.
At year 60, we have paid 3M dollars, and gotten 90 "community-years" of roads out of it. This is no different than the equivalent end of the first year with one community, one loan, and 30 "community-years" of roads.
Community A doesn't take out a loan. The roads were built with exogenous funds. Starting in 30 years, Community A has to start paying X dollars a year in maintenance on it's roads. Community A taxes itself $X/2 per year and pays for a fire department with it.
In Year 30, Community B is built with exogenous funds. Starting in 30 years, Community B has to start paying $X a year in maintenance on it's roads. Community A has to raise taxes to $X, half of which goes to maintenance on it's roads and half of which goes to A's fire department, and Community B starts paying $X in taxes every year, half of which goes to maintenance on A's roads and half of which goes to B's fire department.
In Year 60, Community A and Community B each have to come up with an additional $X/2 per year in taxes or cut their fire departments.
Infrastructure is expensive to replace. These towns and cities are only able to afford to do so by making new subdivisions pay for it. This works so long as you keep growing exponentially, forever, which you won't. Someday, there won't be new people to pay for it.
Yes, but you have to know to exclude their backend. And, that library may be several dependencies deep. Now you're expecting potentially junior developers to have the insight to grep their entire transitive dependency tree, find the nop dep, and exclude it. This kind of silent failure is worse than the alternative.
Whoa. The author recommends including slf4j-nop as an explicit dependency of libraries?! This means that when a developer writes an application and forgets to include an slf4j backend, instead of getting a warning that no backend is configured, that developer will get absolutely no output. This is really bad advice.
The JVM works this way after C programs popularized the precedent - refuse to run at all with a dynamic link error stating the shared library that was expected without any mention of what version of the library should be present in the LD_LIBRARY_PATH or that the shared library version expected actually doesn't match the ABI in some corner cases and silently fails during runtime. That's part of why customizable classloaders in the JVM exist.
That is not what is happening with SLF4J. The nop logger is present in the API artifact. The API artifact chooses to warn when using it. The nop dependency, despite the name, contains no logger at all. It just explicitly forces the usage of the nop logger.
It's pretty clever actually, and works in a straight-forward way.
Here's what we think is happening: Before the fix (pre-15.12.4):
1. Failover starts
2. Both instances accept and process writes simultaneously
3. Failover eventually completes after the writer steps down
4. Result: Potential data consistency issues ???
After the fix (15.12.4+):
1. Failover starts
2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests
3. Both instances restart/crash
4. Failover fails or requires manual intervention
The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue. This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.
The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.