Many issues with this analysis, some others have already mentioned, including: •...

thematrixturtle · on July 16, 2022

> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.

gojomo · on July 16, 2022

> Majestic promotes their list as the "top 1 million websites of the world"

Well, the source URL provided by the article author initially claims, “The million domains we find with the most referring subnets”. Then it makes a contradictory comment mentioning ‘websites’. At best we can say Majestic is vague and/or confused about what they’re providing – but given the author’s results, I suspect this list contains domains with no guarantee Majestic ever saw a live HTTP service on these domains.

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

How about I cite HN user ~gojomo, who for nearly a decade wrote & managed web crawling software for the Internet Archive. He says: “Sites that don’t want to be crawled use every tactic you can imagine to repel unwanted crawlers, including unceremoniously instant-dropping open connections from disfavored IPs and User-Agents. Sadly, given Google’s dominance, many give a free pass to only Google IPs & User-Agents, and maybe a few other search-engines.”

zinekeller · on July 16, 2022

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.

Semaphor · on July 16, 2022

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

We use stage.www.domain.tld for the staging/testing site, but that's about it ;)

ehPReth · on July 16, 2022

looks like you’re not alone!

https://crt.sh/?q=stage.www.%25

(warning: will take a while to load)

justsomehnguy · on July 17, 2022

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

https://psnow.ext.hpe.com/doc/c04543743.pdf?jumpid=in_lit-ps...

https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04543743.pdf

https://h22204.www2.hpe.com/NEP

https://h30125.www3.hpe.com/hpcsn/?hpp

https://h41370.www4.hpe.com/quickspecs/overview.html

spc476 · on July 15, 2022

It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12

useruserabc · on July 15, 2022

As near as I can tell, these are the top 1,000,000 domains referred to by other websites they crawled.

The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.

So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.

[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet