2. Information that can be actively exploited, but can also be fixed so the previous disclosure is harmless. This means passwords, authentication tokens, etc.
I wouldn't call the disclosure harmless. It's unknown if anyone made use of the leaked information before Cloudflare knew, so accounts should be treated as compromised unless it's shown otherwise.
Also, leaking user credentials to any system that handles payments and health info would also breach PCI/HIPAA . This broadens the scope of systems effectively breaking the law.
Another thing to keep in mind is that many(most?) token based authentication systems don't invalidate tokens. So any tokens captured will be valid until they expire, and they can't be "changed" without invalidating every outstanding token (changing the server key)
No I mean after it's fixed, the previously-disclosed information becomes harmless. Obviously anyone who exploited it before you reset your password/tokens may have caused you harm.
> Another thing to keep in mind is that many(most?) token based authentication systems don't invalidate tokens.
In my experience, changing your password generally invalidates all outstanding tokens. And yes, this does mean invalidating all of them instead of just the leaked one, but that's not usually a big deal.
My biggest issue is that they didn't release their data-set. With something this major, it's standard to either have a third party investigate or publicly release your data so it can be validated.
For all we know they cherry picked the responses they tested from a single site that doesn't handle anything sensitive.
> For all we know they cherry picked the responses they tested from a single site that doesn't handle anything sensitive.
I don't think you understand how Cloudbleed works. It doesn't matter what site they picked; every single vulnerable site can leak the exact same info. It's literally impossible to cherry-pick that data.
I don't think you understand how the internet works. Some websites only serve static content and don't deal with any sensitive information. Without seeing Cloudflare's data set there is no way to verify that the responses they picked are a representative sample.
Ok you really don't know how Cloudbleed works. Go read up on it. Every single vulnerable site can and did leak the same information. The only way to "cherry-pick" it would be to literally throw away the responses that you saw and didn't like, or in other words, by lying.
I think he was trying to explain to you that, for this particular leak, what was leaked was the private memory of the CloudFlare servers. So that memory doesn't have all a single site in it. It doesn't matter what site triggered the data to be output, the data that was output can still come from any CloudFlare customer even if they had no pages with the condition that triggered the issue.
Cloudflare mentioned they only went through about 2000 leaked responses out of over a million. That combined with random people on hackernews STILL finding private data after the purge points towards this leak being a lot worse than they're letting on.
Thankfully CloudFlare spent a week cleaning up the leak in search engines and caches before publicly announcing the issue, so a lot of the evidence is gone.
Tagging along on the top reply here, but does anyone else notice some serious HN gaming going on in this thread?
Accounts commenting that have made 10 comments in 5 years, people defending CF that, after going through their profiles, are clearly linked to CloudFlare or maybe even employees. Top relies suddenly being near the bottom and replaced by posts supporting CloudFlare with statements that don't match what was actually in the blog post.
First, no, please don't do that. It's one of the worst patterns in active discussions.
Second, if anyone thinks they see gaming or abuse, they should let us know right away at hn@ycombinator.com so we can investigate. Please don't post about it in comments, for a couple reasons: (1) if there really is abuse going on, we need to know, and we don't see most comments; (2) most of us internet readers are orders of magnitude too quick to interpret our own cognitive biases as abuse by others (e.g. X seems obvious to me so anyone arguing ~X must be a shill). This places the threshold for useless, nasty arguments ('you're a shill. no, you're the shill') dangerously low. Combine that with the evil catnip power of all things meta and you get the most malignant strain of offtopicness there is, so we all have to be careful with it. And yes, genuine abuse does also exist. It's complicated that way.
It's not gaming for a CF employee to offer the own insights, but it would be if they're doing it as part of a coordinated effort. Also, they should be upfront about the affiliation.
I'm defending them because they're an incredible value for the $200/month we pay them, they've handled it responsibly, and the real impact looks vanishingly small.
We take this problem seriously and have years of experience with it. In our experience, "very clearly gamed" nearly always reduces to "seems to me, based on personal biases and incomplete information". When abuse really is going on, that's extremely important, but it is certainly fewer than 10%, probably fewer than 5%, and maybe fewer than 1% of the times we hear this. So we actually have two problems, unfortunately intertwined: (1) abuse and (2) this tendency to imagine it.
For this reason, please don't post unsubstantive allegations to HN. If you have evidence, or significant suspicious, email them to hn@ycombinator.com instead.
they only sampled a few thousand leaked responses out of over a million. The margin of error is 2.5% on conclusions because they didn't use enough data, not even close.
antitrust is all bark with no bite these days. Most industries having 3-4 "competitors" that don't actually compete is the new norm. I wouldn't be amazed if a link is shown between this situation and the increasing wealth inequality
"To get a flavor of how thoroughly the federal government managed competition throughout the economy in the 1960s, consider the case of Brown Shoe Co., Inc. v. United States, in which the Supreme Court blocked a merger that would have given a single distributor a mere 2 percent share of the national shoe market."
Beyond simple anti-trust, in many cases, we need to reform legal structures so that big players can't use them to block out startups.
It's practically impossible to defend yourself well in a lawsuit against any respectably-sized company without a few million lying around, first of all. That affects everything, and big companies use that fact to bully upstarts and other small innovators into shutting down all the time.
My own business was effectively shuttered (had to stop selling our primary product) by a C&D from a Fortune 100. It would've been 5-10 years and ~$5 million to see that case all the way through, and under current precedent, it's very likely I would have lost.
Aside from that, industries frequently get laws and rules put in place with ostensibly-reasonable rationale, while the actual intent is to make it virtually impossible for disruptive competitors to enter the marketplace.
The CFAA is the piece of legislation that primarily enshrines entrenched players in the online space. We also need to reform copyright law and clarify some matters regarding the applicability of EULAs, especially with regard to clickwrap and browsewrap.
Once that's done, the flood of innovators that have been held back by big companies dispatching their law firms will finally be able to contribute, and the internet's competitve landscape will truly be back in the hands of the users. It will shift from "Who has my data? I have to go with them" to "I can use any interface I want to access that data", effectively resolving the chicken-and-egg effect that imperils any potential competing social network (not even Google could compete with Facebook on this!).
That's because you're viewing this situation entirely from within your prior beliefs about large corporations, which somehow have convinced you that only 3 or 4 hosting companies exist and everyone must use them.
Exactly how do you figure this? There are lots of alternatives to S3:
1. All the cloud stuff, which is the new fangled hotness (3 to 4 companies)
2. Non-cloud hosting providers (hundreds of them, from shared hosting to VPSs and above, and usually cheaper if you don't have high traffic)
3. Hosting your own stuff on your own hardware, which is the cheapest if you really need it, and can be done in a data center just about anywhere in the world with good internet.
So, what planet is this that you live on where 3 to 4 companies handle all hosting and we have to use them?
In total, between 22 September 2016 and 18 February 2017 we now estimate based on our logs the bug was triggered 1,242,071 times.
Wow, so just as bad as we thought.
We did not find any passwords, credit cards, health records, social security numbers, or customer encryption keys in the sample set.
BUT WAIT, THERE'S MORE
The sample included thousands of pages and was statistically significant to a confidence level of 99% with a margin of error of 2.5%.
Oh, so it could actually be as a high as 2.5% leaking encryption credentials. And if none of the data was found to leak anything sensitive where the fuck is the dataset? I've been around way too long to take a "study" like this at face value without third party verification.
I also enjoy the straight up lie at the end:
We are continuing to work with third party caches to expunge leaked data and will not let up until every bit has been removed.
That sounds great right? Well, its too bad that a lot of 'third parties' are a box sitting on the corporate network edge that hasn't been touched in 5 years. Deleting all of this data from third party caches is not physically possible. In fact it might actually make things worse because it's destroying evidence of which credentials were leaked.
> We are continuing to work with third party caches
One of the caches they worked with was Baidu, which has direct ties to Chinese intelligence. Just because it isn't publicly available doesn't mean people aren't still pouring over it looking for useful data.
Also, a lot of web spiders are not benign. All sorts of bots troll the internet looking specifically for data leaks like publicly visible email addresses and SSN's. I'm sure they're having a field day.
It seems to me that the absolute number is what's relevant, not how it compares to the total amount of traffic. That's 1.2 million potential data leaks. That's it "out of a bazillion" doesn't change it.
And 1.2 million people dying is nothing compared to the number of people on the earth.
IMO a leak this bad should be enough to sink cloudflare. A provider of SSL was randomly spitting out private data onto public websites. OVER A MILLION TIMES. Entire CA's have been shut down for leaking a couple hundred certificates. This has leaked private data over a million times, cloudflare is a joke
Password would not help if session cookie is leaked. In many instances you cannot do anything towards that, as many services do not have any "logout all"-feature.
It's good practice to destroy all sessions (besides the current one) when a password is changed, since a password change suggests that the old password may have been compromised. Not sure how many websites do that in practice, though.
And this is why I changed the key I use for cookies on the application I had that was behind Cloudflare. This triggered all users to be logged out and invalidated any session cookie out there.
So, yes, responsible websites can mitigate session cookies being leaked.
That said, I am not impressed by Cloudflare's transparency which in this case consists of downplaying things, blaming Google and Taviso and not really taking responsibility.
On sites I write, I hash the hash of the current password into the session key. That way if you change your password all sessions are invalid, even if you change your password to itself.
So every website that uses cloudflare should ask their users to change all of their passwords, credit card numbers, and SSN's?
This leak is being downplayed by webmasters because it's so incredibly bad that there's no way of handling it. The credentials of practically any internet user could have been leaked. The only "safe" way to handle this is to give everyone in the US new credit cards and SSN's and to reset accounts and security questions for every user on a site with cloudflare
No but judging by how much you are freaking out in the comments, I was recommending you to change yours. Im not really sure what you want anyone else to do? It seems like you are just screaming at your monitor over something that while a significant bug isn't a huge deal. This is just basic risk management.
Credit cards and SSNs are regularly compromised. The real problem is that they are used as an authentication mechanism. That's what we should be concerned about.
This issue is a drop in the bucket when it comes to the amount of sensitive data leaked.
1.3m pages served with "Extra" data is miniscule considering the number of actual pages served. Ideally you'd have no pages served but when you've got a bugproof technology the world will be your oyster.
realistically any hash table bigger than the 32 bit boundary will be well outside any CPU caches and possibly even ram. At these sizes the memory latency will dwarf CPU speed
The author seems mistaken about a couple things relating to hash tables. He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.
He is also going to run into modulo bias since using the remainder to calculate a slot position is biased toward certain positions for every table size that isn't a power of two. (see https://ericlippert.com/2013/12/16/how-much-bias-is-introduc... for cool graphs) Prime number table sizes do nothing to fix these issue. The power-of-two size with a bitshift is not just faster, it gets rid of the modulo bias.
Fastrange is much faster than modulo but if his goal is to build the fastest hashtable it's stupid to use a table size of anything except a power of two.
> He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.
[...] using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits.
There is a lot of material in the article on power-of-two vs non-power-of-two sizes and it definitely includes a discussion of how POT is convenient due to modulo becoming a bitmask (not a bitshift, which would be division).
That bias graph is made assuming you are picking numbers between 0 and 32768. With a 64-bit hash the bias is negligible. With a 32-bit hash I suppose you could argue significance around 100 million items.
I think the text is clear that the reason the hash table supports power of two size tables is to avoid the modulo operation.
>The author seems mistaken about a couple things relating to hash tables. He doesn't seem aware that, by far, the main reason for using power-of-two hash table sizes is so you can use a simple bitshift to eliminate modulo entirely.
From the article:
>This is the main reason why really fast hash tables usually use powers of two for the size of the array. Then all you have to do is mask off the upper bits, which you can do in one cycle.
Yeah I'm gonna need an explanation too. Even with hashes hardened against DOS attacks, the lower bits tend to vary with the input far faster than the upper bits.
For instance, several popular hash algorithms (of the form 31 * hash + value) will have the same upper bits for any two strings of the same length, up to about 12 characters in length when the seed value finally gets pushed out. Unless you're still using 32 bit hashes for some reason and then it's most strings under 6 characters long.
Multiply by prime and then add or xor the next value guarantees that the bottom bits are less biased and so fits with any hashtable function that uses modular math, even if the modulus is taken by bit masking. Getting the upper bits to mix would take a different operator or larger operands.
I know I've encountered at least one hash function that uses a much larger prime (16 or more bits) as its multiplier, so it's not unheard of, but it's certainly not universal.
But he doesn't know about ctz (counting trailing zeros) for fast load/fillrate detection. With 50% it's trivial (just a shift), but for 70% or 90% you'd need to use _builtin_ctz.
Not that I disagree with you, but a number of these so-called 'prime' hashtable implementations don't actually use prime numbers. They use 'relatively prime' numbers.
If you happened to use 2^n ± 1 you'd have very little bias according to that map, but you wouldn't strictly be using a power of 2.
Unfortunately Java's original hashtable started at 11 and increased using f(m) = 2 * f(m - 1) + 1, giving you 23, 47, 95, 181, etc. Or pretty close to right on the bad parts of the bias curve shown in that link.
Makes me wonder, if Java hashtables had been given a default size of 15 or 7, if this ever would have been fixed...
I wouldn't call the disclosure harmless. It's unknown if anyone made use of the leaked information before Cloudflare knew, so accounts should be treated as compromised unless it's shown otherwise.
Also, leaking user credentials to any system that handles payments and health info would also breach PCI/HIPAA . This broadens the scope of systems effectively breaking the law.
Another thing to keep in mind is that many(most?) token based authentication systems don't invalidate tokens. So any tokens captured will be valid until they expire, and they can't be "changed" without invalidating every outstanding token (changing the server key)