The Go team has been making progress toward a complete fix to this problem.
Go 1.19 added "go mod download -reuse", which lets it be told about the previous download result including the Git commit refs involved and their hashes. If the relevant parts of the server's advertised ref list is unchanged since the previous download, then the refresh will do nothing more than the ref list, which is very cheap.
The proxy.golang.org service has not yet been updated to use -reuse, but it is on our list of planned work for this year.
On the one hand Sourcehut claims this is a big problem for them, but on the other hand Sourcehut also has told us they don't want us to put in a special case to disable background refreshes (see the comment thread elsewhere on this page [1]).
The offer to disable background refreshes until a more complete fix can be deployed still stands, both to Sourcehut and to anyone else who is bothered by the current load. Feel free to post an issue at https://go.dev/issue/new or email me at rsc@golang.org if you would like to opt your server out of background refreshes.
I realize in the real world most modules are probably hosted by large providers that can absorb the bandwidth, like Github, but it seems incredibly discourteous to not prioritize the hammering of small providers, especially two years on when the response is still "maybe later this year".
I think Drew is right in that he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up.
Between this and the response in the original thread that said, "For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt," it gives the impression that the Go team doesn't care. Sometimes what we _need_ to do to be good netizens is a fair bit of boring technical work but it's essential.
It's super-weird that the Google-side Go folks' responses to this have basically been "we don't have the resources to run this service that we decided to run and that's now misbehaving, responsibly". Like... don't, then? Why take on that kind of thing in the first place if urgent fixes to its generating abusive traffic for no good reason take three years?
The Exception "exclusion solution" definitely it's not personalized and Sourcehut-only but available to anyone that requests it and in the issue track you can see several people that are using this exclusion already.
True that an opt-out (what this solution boils down to) is not ideal but it's way better than using your users quality of service to try strong-arming your side. And anyway from the go team it's been made clear that they are working on improving the situation of the refreshes and this opt-out is just the temporary solution until they fix the real issue.
Hi Russ! Thank you for sharing. I am pleased to hear that there is finally some progress towards a solution for this problem. If you or someone working on the issue can reach out via email (sir@cmpwn.com), I would be happy to discuss the issue further. What you described seems like an incomplete solution, and I would like to discuss some additional details with your team, but it is a good start. I'm also happy to postpone or cancel the planned ban on the Go proxy if there's active motion towards a fix from Google's end. I am, however, a bit uneasy that you mentioned that it's only prioritized for "this year" -- another year of enduring a DoS from Google does not sound great.
I cannot file an issue; as the article explains I was banned from the Go community without explanation or recourse; and the workaround is not satisfying for reasons I outlined in other HN comments and on GitHub. However, I would appreciate receiving a follow-up via email from someone knowledgeable on the matter, and so long as there is an open line of communication I can be much more patient. These things are easily solved when they're treated with mutual respect and collaboration between engineering teams, which has not been my experience so far. That said, I am looking forward to finally putting this issue behind us.
Why does the Go team and/or Google think that it's acceptable to not respect robots.txt and instead DDoS git repositories by default, unless they get put on a list of "special case[s] to disable background refreshes"?
Why was the author of the post banned without notice from the Go issue tracker, removing what is apparently the only way to get on this list aside from emailing you directly?
Do you, personally, find any of this remotely acceptable?
FWIW I don't think this really fits into robots.txt. That file is mostly aimed at crawlers. Not for services loading specific URLs due to (sometimes indirect) user requests.
...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.
> Not for services loading specific URLs due to (sometimes indirect) user requests.
A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.
Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.
I'm taking the OP at his word here, but he specifically claims that the proxy service making these requests will also make requests independent of a `go get` or other user-initiated action, sometimes to the tune of a dozen repos at once and 2500 requests per hour. That sounds like a crawler to me, and even if you want to argue the semantic meaning of the word "crawler," I strongly feel that robots.txt is the best available solution to inform the system what its rate limit should be.
After reading this and your response to a sibling comment I wholeheartedly disagree with you on both the specific definition of the word crawler and what the "main purpose" of robots.txt is, but glad we can agree that Google should be doing more to respect rate limits :)
As annoying as it is, there is precedent for this opinion with RSS aggregator websites like Feedly. They discover new feed URLs when their users add them, and then keep auto-refreshing them without further explicit user interaction. They don't respect robots.txt either.
I wouldn't expect or want an RSS aggregator to respect robots.txt for explicitly added feeds. That is effectively a human action asking for that feed to be monitored so robots.txt doesn't apply.
What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.
I want my software to obey me, not someone else. If the software is discovering resources on its own, then obeying robots.txt is fair. But if the software is polling a resource I explicitly told it to, I would not expect it to make additional requests to fetch unrelated files such as a robots.txt
I can almost see both sides here... But ultimately when you are using someone else's resources, then not respecting their wishes (within reason) just makes you an asshole.
Google began pushing for it to become an Internet standard—explicitly to be applicable to any URI-driven Internet system, not just the Web—in 2019, and it was adopted as an Internet standard in 2022.
This is true but irrelevant to the parent's question -- in the article, it's made clear that Google's requests are happening over HTTP, which is the most obvious reason why robots.txt should be respected.
Read the OP; it's obvious based on the references to robots.txt, the User-Agent header, returning a 429 response, etc, that most (all?) of Google's requests are doing git clones over http(s).
I suspect they have a problem with this DDoS by default unless you ask to opt out behavior. Why is anyone getting hit with these expensive background refreshes until you have a chance to do it right? Why is it still not done right 2 years after this was first reported?
Maybe it should be an opt-in list where the big providers (such as github) can be hit by an army of bots and everyone else is safe by default.
This smells wildly overdramatic. They've been working on solutions big and small since it was reported, it's just that the big solutions take time, this was communicated to Drew.
This reminds me a bit of a disfunctional relationship: clearly Sourcehut wants Google to stop DDoS their servers; clearly Google don’t actually want to DDoS Sourcehut, but Sourcehut also doesn’t want to ask Google to stop, and Google also want to be asked to stop. And so nothing gets done.
The question is who will swallow their pride first: Sourcehut or Google.
This isn't true. Sourcehut reported a bug, and since the bug is somewhat involved to fix entirely, we asked what the impact of the bug is to them and offered to make a custom change for the site in the interim. The impact matters: the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is, as I noted, but since Sourcehut told us explicitly not to put in a special case, we haven't.
Your comment in this thread is the first time I've seen anyone mention that it was being worked on since... June 2021? This despite repeatedly raising the issue up until I was banned without explanation. I was never told, and still don't know, what disabling the refresh entails, the ban prevents me from discussing the matter further, and I was under the impression that no one was working on it. We have suffered a serious communication failure in this incident. That said, I am looking forward to your follow-up email and seeing this issue resolved in a timely and amicable manner.
> the ban prevents me from discussing the matter further
Hi ddevault, FWIW, in May 2022 on that #44577 issue [0] you had opened, it looks like someone on the core Go team commented there [1] recommending that you email the golang-dev mailing list or email them directly.
Separately, it looks like in July 2022, in one of the issues tracking the new friendlier -reuse flag, there was a mention [2] of the #44577 issue you had opened. In the normal course, that would have triggered an automatic update on your #44577 issue... but I suspect because that #44577 issue had been locked by one of the community gardeners as "too heated", that automatic update didn't happen. (Edit: It looks like it was locked due to a series of rapid comments from people unrelated to Sourcehut, including about “scummy behavior”).
Of course, communication on large / sprawling open source projects is never quite perfect, but that's a little extra color...
The offer in [1] was to email the ML to ask for an exclusion, not to continue discussing the general issue which was still being discussed in the GH issue.
And given that they banned him for no reason, he is perfectly in the right to tell them that they should email him instead.
> the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is
We have not been reading the same tickets and articles it seems
No problems were ever mentioned, serious or otherwise. Elevated traffic isn't automatically a problem. Drew's played it up quite a lot elsewhere, but the Go team can only be reasonably expected to follow the one issue filed, not Drew's entire online presence.
Yes. He had plenty of opportunity to state problems if they existed. Relaying harm caused would have likely accelerated things, and if harm was being done he would have taken up the still-open offer to solve this problem in the interim while the real solution is pushed out instead of writing misrepresentative and openly salty blog posts for years. Even with him being banned, the Go team is still tracking this issue, still brings it up internally, and has pushed a feature that would fix this ahead by an entire release.
So yes. The issue they banned him from. Because reality's more complicated than flippant one liners.
Thanks for the insight, Russ. Would you comment on what the potential consequences of opting out of background refreshes would be? Could there be any adverse effects for users?
Opting out of background refreshes would mean that fetching a module version that (1) no one else had fetched in a few days and (2) does not use a recognized
open-source license might not be in the cache, which would make 'go get' take a little extra time while the proxy fetched it on demand. The amount of time would depend on the size of the repo, of course.
The background refresh is meant to prefetch for that situation, to avoid putting that time on an actual user request. It's not perfect but it's far less disruptive than having to set GOPRIVATE.
Some people today raised concerns about disabling background refreshes (the temporary workaround originally suggested by the Go team) as having possibly unacceptable resulting performance for end users...
...but it sounds like disabling background refreshes would have strictly better end-user performance than what the Sourcehut team had been planning as described in their blog post today (GOPRIVATE and whatnot)?
Hey Russ, I got your messages that my emails aren't coming through but I'm not sure why. As an alternative, you can reach me on IRC at ddevault on Libera Chat. I'm in CEST, but my bouncer is always online. Cheers!
Go 1.19 added "go mod download -reuse", which lets it be told about the previous download result including the Git commit refs involved and their hashes. If the relevant parts of the server's advertised ref list is unchanged since the previous download, then the refresh will do nothing more than the ref list, which is very cheap.
The proxy.golang.org service has not yet been updated to use -reuse, but it is on our list of planned work for this year.
On the one hand Sourcehut claims this is a big problem for them, but on the other hand Sourcehut also has told us they don't want us to put in a special case to disable background refreshes (see the comment thread elsewhere on this page [1]).
The offer to disable background refreshes until a more complete fix can be deployed still stands, both to Sourcehut and to anyone else who is bothered by the current load. Feel free to post an issue at https://go.dev/issue/new or email me at rsc@golang.org if you would like to opt your server out of background refreshes.
[1] https://news.ycombinator.com/item?id=34311621