Huh, but why? I can totally understand *scanning* a PDF for links to look for ma...

mdasen · on Oct 26, 2019

How do you know if they're malicious if you don't make HTTP requests to them?

One of the things that phishers and others do is use link wrapping and other services to hide malicious links. So, I get something.wordpress.com/something-clean. I then put in an HTML or JS redirect on that page to something malicious. Given that browsers don't warn about HTTP, HTML, or JS redirects, it's an easy way for scammers to get around a list of malicious pages.

These kinds of attacks are very common in the email space.

gruez · on Oct 26, 2019

But in this case, that doesn't help at all because facebook's crawler uses a predictable user agent string. You give a clean result to the facebook crawler and a malicious result to everyone else.

seanieb · on Oct 26, 2019

There are services to frawl for you from miltipke ips and user agents, just for situations like this.

idoco · on Oct 26, 2019

That is a very good point. Security crawlers should probably use a masked user-agent.

marksomnian · on Oct 26, 2019

I'm fairly sure Google's search crawler already uses a masked UA, to detect when pages serve it different content than they do to users.

allie1 · on Oct 26, 2019

Not always, it masks UA and IPs when checking for ads content to uncover cloakers, so its within theit codebase to do this. Not sure why they’re not using it here.

bluntfang · on Oct 26, 2019

>How do you know if they're malicious if you don't make HTTP requests to them?

look-alike domains are phishing vector that don't require you to make an http request.

TazeTSchnitzel · on Oct 26, 2019

The malicious links could be camouflaged behind a redirect.

allie1 · on Oct 26, 2019

Could be collecting the links so if a user blocks the sender after opening the pdf, and this is done at scale, they can infer it was one of the links and starts blocking them?

Or link support requests to people who received a certain link via message.

So basically data mining to feed a model that takes future actions in consideration.

danielfoster · on Oct 26, 2019

Probably anti-spam, particularly to catch groups of fake accounts sending the same or similar PDF.

austhrow743 · on Oct 26, 2019

How do you check if a link is serving up something terrible without http requests to them?

tialaramex · on Oct 26, 2019

You _could_ ask a service like Google Safe Search

Just in case you didn't follow any of the previous HN discussion of how that's done

consider the URL https://accounts.example.com/tmp/badmojo.exe

You (Facebook in this case) run a hypothetical method SafeSearch('accounts.example.com') and also SafeSearch('example.com') and SafeSearch('accounts.example.com/tmp') and SafeSearch('accounts.example.com/tmp/badmojo.exe')

SafeSearch(string) is defined as, you do SHA(string) and that's your hash, you compare the start of this hash to a huge list of prefixes that Google provides, which you fetch updates for every few minutes. If there's no match, fine, done. If there's a match you ask Google OK, I saw this Prefix you sent me, what hashes should I be scared of? Google gives you a list of hashes with that Prefix. If your hash in this new list, the original URL was scary, warn users not to visit, otherwise continue what you were doing.

gojomo · on Oct 26, 2019

Sure, but this will only work for previously-known threats – for which someone else, presumably Google, has already done the request, analysis, and determination.

I doubt Facebook only wants to detect old threats, reliant on a competitor's standards & practices.

Frost1x · on Oct 26, 2019

The obvious argument is they need to scan pages linked for malware and couldn't rely on a white/black list.

I'm sure if they're pulling data to do this analysis, it's not the only analysis they're doing.