Finding malware & spam in Twitter

Some months ago I had the feeling that, probably, some social networks like Facebook or Twitter would be a good source to find both malware & spam and, as so, I decided to write a quick honeypot to try to catch as much malware & spam as possible as soon as they become available on the internet. You can find the daily updated files of URLs gathered by this tool here, under the section “Malicious URLs”.

Finding the URLs

My idea was as easy as the following: just find messages (comments, tweets, etc…) with links and analyse the URLs found in them to try to determine if they appear to be malicious or not and, also, download whatever files the page links to and analyse them too. Unfortunately, the API of Facebook doesn’t let me search for links in all comments made by the general public (or I wasn’t able to find such a method) but Twitter’s API does. It’s as simple as just finding “filter:links” or “http://”. However, the number of messages and URLs to process is so big and the domains found in the messages are so weird you cannot even guess until you start… And if you don’t use only Twitter but other compatible systems like identi.ca it becomes even worse. Too much data to analyse…

Minimizing the number of URLs to analyse

The next step in my experiment was to find a list of interesting search terms to find malware and reduce the number of URLs to analyse as searching all the messages wrote in twitter and identi.ca and performing analysis over all the URLs found in them was too much work for my home machines. Finding good search terms for this purpose is not that easy, really. However, some typical searches are still useful 🙂

download exe filter:links
download free filter:links
exe filter:links
free porn download filter:links
download apk filter:links
download fast filter:links

So, I added those search terms to my engine and left it running for 2 days. The results were somewhat better than finding all the URLs but still not very good. The real change to this systems came as soon as I started using the trending topics. That’s: find all messages wrote with some hashtag adding the ‘filter:links’ magic search word. Typically, around 80% of the links found in such messages are spam or malware. And, not surprisingly, those messages are always wrote by female users, as in the following picture:

Those are all spam messages, all of them are wrote by users with female names. And, something curious: the hashtag is in one language (spanish) but the message is always in english, no matter in what language the trending topic is.

URL Scoring Rules

To try to minimize the number of false positives I had to do many checks as not all the URLs found in social networks (like twitter) are bad (or good). My way to determine if the URLs seems to be good or bad was to create a scoring system where I consider the following variables:

Using the Alexa’s Top 1 Million database, consider that if the pointed domain (the resolved one in case they use URL shorters) is in that list, that domain is not bad by default, thus, decrementing one point.
If the domain is not in the Alexa’s Top, increment one point as the domain is unknown.
If the file where it points to starts with either “MZ” or “\x7fELF” or “\xCA\xFE\xBA\xBE”, add one point (direct linking to executables is very rare, in any case).
If the domain seems to be obfuscated, add one point. If there are more than 4 consecutive vowels or more than 6 consecutive consonants, I consider it obfuscated. Also, if the domain has 8 or more characters and there aren’t either vowels or consonants, I consider it obfuscated. It seems to work good enough in general.
If the domain sounds too similar to a domain in the Alexa’s Top 1 Million, add one point, as it can be a trick using typos (i.e., writing gugle instead of google). Implementing it is as easy as using soundex.

Some other heuristics I was thinking to add were the following:

If the user is a woman, add new point. I’m not a male chauvinist but, well, the ‘bad guys’ always send spam, malware and other kind of bullshit using accounts with female names.
If the picture of the user (the avatar) is a naked person (well, actually a naked girl), it shows tits, or shows the ass, or something sexually related, add another point. Too hard for an experiment at home: it’s easy to identify this with a brain when looking to the pictures but extremely hard to code…
If the hashtag is in one language which is not english, but the message is in english, add one point. Easy to do.

With those scoring rules, if one URL gets a score of 3 points or more, it can be considered ‘suspicious enough’. So, I wrote my code, tested it and left it running for some days recording all the URLs, messages and users sending spam and linking to malware in Twitter. My results are the following…

Results of the experiment

Unfortunately for me, it seems Twitter is not widely used to spread malware as the number of actual malware samples I find using this honeypot is very-very-very low (about 3 samples per day, at much). However, the number of spam messages is really big. For example, the number of spam messages in trending topics is around 80% of the total number of messages with links (those are the results of my experiment at home). Almost all the spam messages, as I mentioned before, are ‘wrote’ by users with female names and with sex related pictures so, I guess, the spamers want to catch only ‘guys’. Many times, the very same spam message is wrote by many different users: both the text and the URL. For Twitter, it would be very easy to detect and remove those accounts. They could automate it very easy.

And, that’s all! It’s just an “I was tired one afternoon” experiment so, don’t expect too much 😉

PS: I will probably release the source code in Google Code or some other place like this in the near future. But I need to clean it up.