Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Huh, but why?

I can totally understand scanning a PDF for links to look for malicious links to protect users.

But that wouldn't involve actual HTTP requests to them.

I'm struggling to imagine what purpose this could have.



How do you know if they're malicious if you don't make HTTP requests to them?

One of the things that phishers and others do is use link wrapping and other services to hide malicious links. So, I get something.wordpress.com/something-clean. I then put in an HTML or JS redirect on that page to something malicious. Given that browsers don't warn about HTTP, HTML, or JS redirects, it's an easy way for scammers to get around a list of malicious pages.

These kinds of attacks are very common in the email space.


But in this case, that doesn't help at all because facebook's crawler uses a predictable user agent string. You give a clean result to the facebook crawler and a malicious result to everyone else.


There are services to frawl for you from miltipke ips and user agents, just for situations like this.


That is a very good point. Security crawlers should probably use a masked user-agent.


I'm fairly sure Google's search crawler already uses a masked UA, to detect when pages serve it different content than they do to users.


Not always, it masks UA and IPs when checking for ads content to uncover cloakers, so its within theit codebase to do this. Not sure why they’re not using it here.


>How do you know if they're malicious if you don't make HTTP requests to them?

look-alike domains are phishing vector that don't require you to make an http request.


The malicious links could be camouflaged behind a redirect.


Could be collecting the links so if a user blocks the sender after opening the pdf, and this is done at scale, they can infer it was one of the links and starts blocking them?

Or link support requests to people who received a certain link via message.

So basically data mining to feed a model that takes future actions in consideration.


Probably anti-spam, particularly to catch groups of fake accounts sending the same or similar PDF.


How do you check if a link is serving up something terrible without http requests to them?


You _could_ ask a service like Google Safe Search

Just in case you didn't follow any of the previous HN discussion of how that's done

consider the URL https://accounts.example.com/tmp/badmojo.exe

You (Facebook in this case) run a hypothetical method SafeSearch('accounts.example.com') and also SafeSearch('example.com') and SafeSearch('accounts.example.com/tmp') and SafeSearch('accounts.example.com/tmp/badmojo.exe')

SafeSearch(string) is defined as, you do SHA(string) and that's your hash, you compare the start of this hash to a huge list of prefixes that Google provides, which you fetch updates for every few minutes. If there's no match, fine, done. If there's a match you ask Google OK, I saw this Prefix you sent me, what hashes should I be scared of? Google gives you a list of hashes with that Prefix. If your hash in this new list, the original URL was scary, warn users not to visit, otherwise continue what you were doing.


Sure, but this will only work for previously-known threats – for which someone else, presumably Google, has already done the request, analysis, and determination.

I doubt Facebook only wants to detect old threats, reliant on a competitor's standards & practices.


The obvious argument is they need to scan pages linked for malware and couldn't rely on a white/black list.

I'm sure if they're pulling data to do this analysis, it's not the only analysis they're doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: