Finding malware & spam in Twitter

Some months ago I had the feeling that, probably, some social networks like Facebook or Twitter would be a good source to find both malware & spam and, as so, I decided to write a quick honeypot to try to catch as much malware & spam as possible as soon as they become available on the internet. You can find the daily updated files of URLs gathered by this tool here, under the section “Malicious URLs”.

Finding the URLs

My idea was as easy as the following: just find messages (comments, tweets, etc…) with links and analyse the URLs found in them to try to determine if they appear to be malicious or not and, also, download whatever files the page links to and analyse them too. Unfortunately, the API of Facebook doesn’t let me search for links in all comments made by the general public (or I wasn’t able to find such a method) but Twitter’s API does. It’s as simple as just finding “filter:links” or “http://”. However, the number of messages and URLs to process is so big and the domains found in the messages are so weird you cannot even guess until you start… And if you don’t use only Twitter but other compatible systems like it becomes even worse. Too much data to analyse…

Minimizing the number of URLs to analyse

The next step in my experiment was to find a list of interesting search terms to find malware and reduce the number of URLs to analyse as searching all the messages wrote in twitter and and performing analysis over all the URLs found in them was too much work for my home machines. Finding good search terms for this purpose is not that easy, really. However, some typical searches are still useful 🙂

  1. download exe filter:links
  2. download free filter:links
  3. exe filter:links
  4. free porn download filter:links
  5. download apk filter:links
  6. download fast filter:links
So, I added those search terms to my engine and left it running for 2 days. The results were somewhat better than finding all the URLs but still not very good. The real change to this systems came as soon as I started using the trending topics. That’s: find all messages wrote with some hashtag adding the ‘filter:links’ magic search word. Typically, around 80% of the links found in such messages are spam or malware. And, not surprisingly, those messages are always wrote by female users, as in the following picture:
Those are all spam messages, all of them are wrote by users with female names. And, something curious: the hashtag is in one language (spanish) but the message is always in english, no matter in what language the trending topic is.

URL Scoring Rules

To try to minimize the number of false positives I had to do many checks as not all the URLs found in social networks (like twitter) are bad (or good). My way to determine if the URLs seems to be good or bad was to create a scoring system where I consider the following variables:

  1. Using the Alexa’s Top 1 Million database, consider that if the pointed domain (the resolved one in case they use URL shorters) is in that list, that domain is not bad by default, thus, decrementing one point.
  2. If the domain is not in the Alexa’s Top, increment one point as the domain is unknown.
  3. If the file where it points to starts with either “MZ” or “\x7fELF” or “\xCA\xFE\xBA\xBE”, add one point (direct linking to executables is very rare, in any case).
  4. If the domain seems to be obfuscated, add one point. If there are more than 4 consecutive vowels or more than 6 consecutive consonants, I consider it obfuscated. Also, if the domain has 8 or more characters and there aren’t either vowels or consonants, I consider it obfuscated. It seems to work good enough in general.
  5. If the domain sounds too similar to a domain in the Alexa’s Top 1 Million, add one point, as it can be a trick using typos (i.e., writing gugle instead of google). Implementing it is as easy as using soundex.

Some other heuristics I was thinking to add were the following:

  1. If the user is a woman, add new point. I’m not a male chauvinist but, well, the ‘bad guys’ always send spam, malware and other kind of bullshit using accounts with female names.
  2. If the picture of the user (the avatar) is a naked person (well, actually a naked girl), it shows tits, or shows the ass, or something sexually related, add another point. Too hard for an experiment at home: it’s easy to identify this with a brain when looking to the pictures but extremely hard to code…
  3. If the hashtag is in one language which is not english, but the message is in english, add one point. Easy to do.

With those scoring rules, if one URL gets a score of 3 points or more, it can be considered ‘suspicious enough’. So, I wrote my code, tested it and left it running for some days recording all the URLs, messages and users sending spam and linking to malware in Twitter. My results are the following…

Results of the experiment

Unfortunately for me, it seems Twitter is not widely used to spread malware as the number of actual malware samples I find using this honeypot is very-very-very low (about 3 samples per day, at much). However, the number of spam messages is really big. For example, the number of spam messages in trending topics is around 80% of the total number of messages with links (those are the results of my experiment at home). Almost all the spam messages, as I mentioned before, are ‘wrote’ by users with female names and with sex related pictures so, I guess, the spamers want to catch only ‘guys’. Many times, the very same spam message is wrote by many different users: both the text and the URL. For Twitter, it would be very easy to detect and remove those accounts. They could automate it very easy.

And, that’s all! It’s just an “I was tired one afternoon” experiment so, don’t expect too much 😉

PS: I will probably release the source code in Google Code or some other place like this in the near future. But I need to clean it up.


6 thoughts on “Finding malware & spam in Twitter

  1. Martin

    Seriously? You’re searching for “free porn”, then discover naked girl avatars, then decides there are increased chances it’s malware or spam for both “free porn” and naked avatar? Same remark for searching “download exe”, then again “exe” and looking for executable files at the other end of the link. Don’t you think you’re counting twice the same parameters?

    What about analyzing a bunch of links first, links known as being spammy or pointing to malware, because they are blocked by known up to date anti-virus or anti-malware software, THEN trying to find the right parameters and weights to create a kind of heuristics?

  2. joxean Post author

    No, no. At first I started analysing all URLs found without finding anything special: just messages with links. When I started discovering a lot of accounts with naked girl avatars was when I started finding messages with URLs using the trending topics. It’s obvious that you’re going to find sexually related content if you find ‘porn’.

    >What about analyzing a bunch of links first, links known as being spammy or pointing to malware, because they are blocked by known up to date anti-virus or anti-malware software,
    >THEN trying to find the right parameters and weights to create a kind of heuristics?

    Can you explain this, please? I do not understand.

  3. Pingback: Enlaces de la SECmana – 137 | Desgobierno de Chile

  4. Martin

    As suggested by @kutyacic, any method to find malware on Twitter is useless. Since Twitter uses Google’s Safe Browsing API, you can only find those malware that are not yet identified that way.

    Twitter spam is another issue. Twitter is very bad in identifying spam. Or maybe am I the only one followed by new spam accounts every day? Having said that, the definition of spam depends of one’s point of view. In my opinion, on Twitter, spam includes any automated activity lacking intelligence, like automatically searching for hashtags to follow accounts in the hope they will follow back (sometimes called “agressive following spam” or “follow back spam”). Other spammers try to promote affiliate links hidden with various links shorteners and using hundreds of fake accounts.

    Anyway, if I had to identify some heuristics about a group of data, first of all, I would grab some random large data and flag it manually first (the best would be to make two passes, in order to reduce the risk of false positives and false negatives, as well as check human detection performance). Then, I would try to guess a list of parameters that might be useful to classify that data. Finally, I would try to find the weight of each parameter. Using an automatic, self learning classifier might be interesting here (I’ve been told SVM is pretty interesting).

    Your parameters (keywords, avatars, etc.) are probably good. I am less convinced about their weights/scoring.

  5. joxean Post author

    Twitter is, indeed, very bad at identifying spam, if they bother at all. They could identify spamers and spam messages pretty easy right now. However, it seems they don’t bother.

    In any case, remember that this is a ‘project’ created one afternoon when I was bored. Nothing ‘too’ professional, to be honest. Just the implementation of an idea I had that wasn’t that good, being honest.

  6. Pingback: Malware URLs « Unintended Results

Leave a Reply

Your email address will not be published. Required fields are marked *