Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many issues with this analysis, some others have already mentioned, including:

• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)

• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.

• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.

• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.

If you want to probe if domains are still active:

• confirm they're still registered via a `whois`-like lookup

• examine their DNS records for evidence of current services

• ping them, or any DNS-evident subdomains

• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)

• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services

If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.



> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.


> Majestic promotes their list as the "top 1 million websites of the world"

Well, the source URL provided by the article author initially claims, “The million domains we find with the most referring subnets”. Then it makes a contradictory comment mentioning ‘websites’. At best we can say Majestic is vague and/or confused about what they’re providing – but given the author’s results, I suspect this list contains domains with no guarantee Majestic ever saw a live HTTP service on these domains.

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

How about I cite HN user ~gojomo, who for nearly a decade wrote & managed web crawling software for the Internet Archive. He says: “Sites that don’t want to be crawled use every tactic you can imagine to repel unwanted crawlers, including unceremoniously instant-dropping open connections from disfavored IPs and User-Agents. Sadly, given Google’s dominance, many give a free pass to only Google IPs & User-Agents, and maybe a few other search-engines.”


> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.


> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

We use stage.www.domain.tld for the staging/testing site, but that's about it ;)


looks like you’re not alone!

https://crt.sh/?q=stage.www.%25

(warning: will take a while to load)



It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12


As near as I can tell, these are the top 1,000,000 domains referred to by other websites they crawled.

The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.

So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.

[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: