The Go team has been making progress toward a complete fix to this problem. Go 1...

maximilianburke · on Jan 9, 2023

I realize in the real world most modules are probably hosted by large providers that can absorb the bandwidth, like Github, but it seems incredibly discourteous to not prioritize the hammering of small providers, especially two years on when the response is still "maybe later this year".

I think Drew is right in that he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up.

Between this and the response in the original thread that said, "For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt," it gives the impression that the Go team doesn't care. Sometimes what we _need_ to do to be good netizens is a fair bit of boring technical work but it's essential.

yamtaddle · on Jan 9, 2023

It's super-weird that the Google-side Go folks' responses to this have basically been "we don't have the resources to run this service that we decided to run and that's now misbehaving, responsibly". Like... don't, then? Why take on that kind of thing in the first place if urgent fixes to its generating abusive traffic for no good reason take three years?

simon_o · on Jan 10, 2023

> he shouldn't take a personalized Sourcehut-only exception because this doesn't address the core issue for any new small providers that pop up

Exactly. We already saw how this ended with Google vs. people running mail servers.

cowl · on Jan 11, 2023

The Exception "exclusion solution" definitely it's not personalized and Sourcehut-only but available to anyone that requests it and in the issue track you can see several people that are using this exclusion already. True that an opt-out (what this solution boils down to) is not ideal but it's way better than using your users quality of service to try strong-arming your side. And anyway from the go team it's been made clear that they are working on improving the situation of the refreshes and this opt-out is just the temporary solution until they fix the real issue.

ddevault · on Jan 9, 2023

Hi Russ! Thank you for sharing. I am pleased to hear that there is finally some progress towards a solution for this problem. If you or someone working on the issue can reach out via email (sir@cmpwn.com), I would be happy to discuss the issue further. What you described seems like an incomplete solution, and I would like to discuss some additional details with your team, but it is a good start. I'm also happy to postpone or cancel the planned ban on the Go proxy if there's active motion towards a fix from Google's end. I am, however, a bit uneasy that you mentioned that it's only prioritized for "this year" -- another year of enduring a DoS from Google does not sound great.

I cannot file an issue; as the article explains I was banned from the Go community without explanation or recourse; and the workaround is not satisfying for reasons I outlined in other HN comments and on GitHub. However, I would appreciate receiving a follow-up via email from someone knowledgeable on the matter, and so long as there is an open line of communication I can be much more patient. These things are easily solved when they're treated with mutual respect and collaboration between engineering teams, which has not been my experience so far. That said, I am looking forward to finally putting this issue behind us.

rsc · on Jan 9, 2023

Sounds good. I will reach out over email. Thanks.

cmatthias · on Jan 9, 2023

Why does the Go team and/or Google think that it's acceptable to not respect robots.txt and instead DDoS git repositories by default, unless they get put on a list of "special case[s] to disable background refreshes"?

Why was the author of the post banned without notice from the Go issue tracker, removing what is apparently the only way to get on this list aside from emailing you directly?

Do you, personally, find any of this remotely acceptable?

kevincox · on Jan 9, 2023

FWIW I don't think this really fits into robots.txt. That file is mostly aimed at crawlers. Not for services loading specific URLs due to (sometimes indirect) user requests.

...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.

rakoo · on Jan 9, 2023

> Not for services loading specific URLs due to (sometimes indirect) user requests.

A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.

Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.

cmatthias · on Jan 9, 2023

I'm taking the OP at his word here, but he specifically claims that the proxy service making these requests will also make requests independent of a `go get` or other user-initiated action, sometimes to the tune of a dozen repos at once and 2500 requests per hour. That sounds like a crawler to me, and even if you want to argue the semantic meaning of the word "crawler," I strongly feel that robots.txt is the best available solution to inform the system what its rate limit should be.

kevincox · on Jan 9, 2023

When I mean crawler I mean something that discovers new pages. Refreshing the same URL isn't really crawling.

But yes, it may be the best available solution in this case, even if I would argue that it isn't really it's main purpose.

cmatthias · on Jan 9, 2023

After reading this and your response to a sibling comment I wholeheartedly disagree with you on both the specific definition of the word crawler and what the "main purpose" of robots.txt is, but glad we can agree that Google should be doing more to respect rate limits :)

ddevault · on Jan 9, 2023

What you're thinking about, in my opinion, is best referred to as a spider.

Arnavion · on Jan 9, 2023

As annoying as it is, there is precedent for this opinion with RSS aggregator websites like Feedly. They discover new feed URLs when their users add them, and then keep auto-refreshing them without further explicit user interaction. They don't respect robots.txt either.

kevincox · on Jan 9, 2023

I wouldn't expect or want an RSS aggregator to respect robots.txt for explicitly added feeds. That is effectively a human action asking for that feed to be monitored so robots.txt doesn't apply.

What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.

Arnavion · on Jan 9, 2023

robots.txt was originally created to include such bots. That they think they don't need to respect it goes against the original intent.

Eg: https://www.robotstxt.org/faq/kinds.html >"What's New" monitoring

counttheforks · on Jan 10, 2023

I want my software to obey me, not someone else. If the software is discovering resources on its own, then obeying robots.txt is fair. But if the software is polling a resource I explicitly told it to, I would not expect it to make additional requests to fetch unrelated files such as a robots.txt

chillfox · on Jan 10, 2023

I can almost see both sides here... But ultimately when you are using someone else's resources, then not respecting their wishes (within reason) just makes you an asshole.

michaelcampbell · on Jan 10, 2023

Going up the stack a bit this feels to me like the same sort of "we know better" mentality that said no one really needs generics.

jslql · on Jan 9, 2023

Why should a git client respect an http standard such as robots.txt?

yamtaddle · on Jan 9, 2023

Google began pushing for it to become an Internet standard—explicitly to be applicable to any URI-driven Internet system, not just the Web—in 2019, and it was adopted as an Internet standard in 2022.

https://developers.google.com/search/blog/2019/07/rep-id

cmatthias · on Jan 10, 2023

This is true but irrelevant to the parent's question -- in the article, it's made clear that Google's requests are happening over HTTP, which is the most obvious reason why robots.txt should be respected.

yamtaddle · on Jan 10, 2023

It's relevant because it attacks the premise of their objection.

cmatthias · on Jan 9, 2023

Read the OP; it's obvious based on the references to robots.txt, the User-Agent header, returning a 429 response, etc, that most (all?) of Google's requests are doing git clones over http(s).

trulyrandom · on Jan 9, 2023

Because it uses HTTP.

cstejerean · on Jan 9, 2023

I suspect they have a problem with this DDoS by default unless you ask to opt out behavior. Why is anyone getting hit with these expensive background refreshes until you have a chance to do it right? Why is it still not done right 2 years after this was first reported?

Maybe it should be an opt-in list where the big providers (such as github) can be hit by an army of bots and everyone else is safe by default.

dpkirchner · on Jan 9, 2023

And why isn't this a priority? These folks are offering a service that benefits the Go ecosystem.

It smells like Go is on its way out.

jdiff · on Jan 10, 2023

This smells wildly overdramatic. They've been working on solutions big and small since it was reported, it's just that the big solutions take time, this was communicated to Drew.

eadmund · on Jan 9, 2023

This reminds me a bit of a disfunctional relationship: clearly Sourcehut wants Google to stop DDoS their servers; clearly Google don’t actually want to DDoS Sourcehut, but Sourcehut also doesn’t want to ask Google to stop, and Google also want to be asked to stop. And so nothing gets done.

The question is who will swallow their pride first: Sourcehut or Google.

rsc · on Jan 9, 2023

This isn't true. Sourcehut reported a bug, and since the bug is somewhat involved to fix entirely, we asked what the impact of the bug is to them and offered to make a custom change for the site in the interim. The impact matters: the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is, as I noted, but since Sourcehut told us explicitly not to put in a special case, we haven't.

ddevault · on Jan 9, 2023

Your comment in this thread is the first time I've seen anyone mention that it was being worked on since... June 2021? This despite repeatedly raising the issue up until I was banned without explanation. I was never told, and still don't know, what disabling the refresh entails, the ban prevents me from discussing the matter further, and I was under the impression that no one was working on it. We have suffered a serious communication failure in this incident. That said, I am looking forward to your follow-up email and seeing this issue resolved in a timely and amicable manner.

typical182 · on Jan 9, 2023

> the ban prevents me from discussing the matter further

Hi ddevault, FWIW, in May 2022 on that #44577 issue [0] you had opened, it looks like someone on the core Go team commented there [1] recommending that you email the golang-dev mailing list or email them directly.

Separately, it looks like in July 2022, in one of the issues tracking the new friendlier -reuse flag, there was a mention [2] of the #44577 issue you had opened. In the normal course, that would have triggered an automatic update on your #44577 issue... but I suspect because that #44577 issue had been locked by one of the community gardeners as "too heated", that automatic update didn't happen. (Edit: It looks like it was locked due to a series of rapid comments from people unrelated to Sourcehut, including about “scummy behavior”).

Of course, communication on large / sprawling open source projects is never quite perfect, but that's a little extra color...

[0] https://github.com/golang/go/issues/44577

[1] https://github.com/golang/go/issues/44577#issuecomment-11378...

[2] https://github.com/golang/go/issues/53644#issuecomment-11751...

Arnavion · on Jan 9, 2023

The offer in [1] was to email the ML to ask for an exclusion, not to continue discussing the general issue which was still being discussed in the GH issue.

And given that they banned him for no reason, he is perfectly in the right to tell them that they should email him instead.

ddevault · on Jan 9, 2023

Correct: it was made clear to me in no uncertain terms that the only thing I was allowed to say was "yes" or "no" to this offer.

ikiris · on Jan 9, 2023

Now witness the firepower of this fully ARMED and OPERATIONAL Google cloud.

rakoo · on Jan 9, 2023

> the appropriate response is different for "I saw this in my logs and it looks weird but it's not bothering me" versus "this is causing serious problems for my site". We have been getting mixed signals about which it is

We have not been reading the same tickets and articles it seems

jdiff · on Jan 10, 2023

No problems were ever mentioned, serious or otherwise. Elevated traffic isn't automatically a problem. Drew's played it up quite a lot elsewhere, but the Go team can only be reasonably expected to follow the one issue filed, not Drew's entire online presence.

simon_o · on Jan 10, 2023

You mean the issue they banned him from? ;-)

jdiff · on Jan 10, 2023

Yes. He had plenty of opportunity to state problems if they existed. Relaying harm caused would have likely accelerated things, and if harm was being done he would have taken up the still-open offer to solve this problem in the interim while the real solution is pushed out instead of writing misrepresentative and openly salty blog posts for years. Even with him being banned, the Go team is still tracking this issue, still brings it up internally, and has pushed a feature that would fix this ahead by an entire release.

So yes. The issue they banned him from. Because reality's more complicated than flippant one liners.

noughts · on Jan 9, 2023

Thanks for the insight, Russ. Would you comment on what the potential consequences of opting out of background refreshes would be? Could there be any adverse effects for users?

rsc · on Jan 9, 2023

Opting out of background refreshes would mean that fetching a module version that (1) no one else had fetched in a few days and (2) does not use a recognized open-source license might not be in the cache, which would make 'go get' take a little extra time while the proxy fetched it on demand. The amount of time would depend on the size of the repo, of course.

The background refresh is meant to prefetch for that situation, to avoid putting that time on an actual user request. It's not perfect but it's far less disruptive than having to set GOPRIVATE.

typical182 · on Jan 9, 2023

Some people today raised concerns about disabling background refreshes (the temporary workaround originally suggested by the Go team) as having possibly unacceptable resulting performance for end users...

...but it sounds like disabling background refreshes would have strictly better end-user performance than what the Sourcehut team had been planning as described in their blog post today (GOPRIVATE and whatnot)?

ddevault · on Jan 19, 2023

Hey Russ, I got your messages that my emails aren't coming through but I'm not sure why. As an alternative, you can reach me on IRC at ddevault on Libera Chat. I'm in CEST, but my bouncer is always online. Cheers!

codesniperjoe · on Jan 10, 2023

If google is on the way to fix this, a look at the resources on (EVERYONES LOCAL!) not de-duplicated $GOMODCACHE would be very welcome too!

(At least from an enviromental perspective.)

Easy to verify, get a report [1]: go install paepcke.de/fsdd/cmd/fsdd@latest && cd $GOMODCACHE && fsdd .

[1] Warning: Apple user with fixed restricted & expensive nvme space maybe very upset. Easy fixable via fsdd . --hard-link.