How do you know if they're malicious if you don't make HTTP requests to them?
One of the things that phishers and others do is use link wrapping and other services to hide malicious links. So, I get something.wordpress.com/something-clean. I then put in an HTML or JS redirect on that page to something malicious. Given that browsers don't warn about HTTP, HTML, or JS redirects, it's an easy way for scammers to get around a list of malicious pages.
These kinds of attacks are very common in the email space.
But in this case, that doesn't help at all because facebook's crawler uses a predictable user agent string. You give a clean result to the facebook crawler and a malicious result to everyone else.
Not always, it masks UA and IPs when checking for ads content to uncover cloakers, so its within theit codebase to do this. Not sure why they’re not using it here.
Could be collecting the links so if a user blocks the sender after opening the pdf, and this is done at scale, they can infer it was one of the links and starts blocking them?
Or link support requests to people who received a certain link via message.
So basically data mining to feed a model that takes future actions in consideration.
You (Facebook in this case) run a hypothetical method SafeSearch('accounts.example.com') and also SafeSearch('example.com') and SafeSearch('accounts.example.com/tmp') and SafeSearch('accounts.example.com/tmp/badmojo.exe')
SafeSearch(string) is defined as, you do SHA(string) and that's your hash, you compare the start of this hash to a huge list of prefixes that Google provides, which you fetch updates for every few minutes. If there's no match, fine, done. If there's a match you ask Google OK, I saw this Prefix you sent me, what hashes should I be scared of? Google gives you a list of hashes with that Prefix. If your hash in this new list, the original URL was scary, warn users not to visit, otherwise continue what you were doing.
Sure, but this will only work for previously-known threats – for which someone else, presumably Google, has already done the request, analysis, and determination.
I doubt Facebook only wants to detect old threats, reliant on a competitor's standards & practices.
I can totally understand scanning a PDF for links to look for malicious links to protect users.
But that wouldn't involve actual HTTP requests to them.
I'm struggling to imagine what purpose this could have.