Extracting binary patterns in malware sets and generating Yara rules
Some time ago a friend and I were talking about how to create a tool to compare a set of malware samples and extract the binary patterns matched in all or most of the samples. Searching for diffing algorithms I found out some very interesting books on the matter like “O(ND) Difference Algorithm and its Variations” and many utility libraries for diffing like Google Diff Match Patch. Finally, I decided to write a test tool using this library and ended up with an automatic Yara signatures generator.
The tool I wrote in Python is far from efficient (it’s slow) but “works”. The tool does the following:
- Read all the files of a directory given via the command line.
- Diff all the files and save the matching blocks for later analysis.
- Compare and save the blocks matched in, at least, 70% of the samples with a minimum size of 5 bytes.
- Print out the similar blocks.
For the test I used a set of (old) malwares packed with AutoIt. The following is a sample Yara rule generated by this tool:
$ ./tester.py malware/autoit/ CFileDiffer: Diffing a total of 10 file(s) CFileDiffer: Diffing file 1 out of 10 CFileDiffer: Diffing file 2 out of 10 CFileDiffer: Diffing file 3 out of 10 CFileDiffer: Diffing file 4 out of 10 CFileDiffer: Diffing file 5 out of 10 CFileDiffer: Diffing file 6 out of 10 CFileDiffer: Diffing file 7 out of 10 CFileDiffer: Diffing file 8 out of 10 CFileDiffer: Diffing file 9 out of 10 CFileDiffer: Diffing file 10 out of 10 rule test : test { strings: $a = { 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 } $b = { 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 00 00 00 0e 1f ba 0e 00 b4 09 cd 21 b8 01 4c cd 21 54 68 69 73 20 70 72 6f 67 72 61 6d 20 63 61 6e 6e 6f 74 20 62 65 20 72 75 6e 20 69 6e 20 44 4f 53 20 6d 6f 64 65 2e 0d 0d 0a 24 00 00 00 00 00 00 00 } $c = { 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 } $d = { 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 b8 00 00 00 00 00 00 00 40 00 00 00 } $e = "AU3!EA06" $f = { 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f8 00 00 00 0e 1f ba 0e 00 b4 09 cd 21 b8 01 4c cd 21 54 68 69 73 20 70 72 6f 67 72 61 6d 20 63 61 6e 6e 6f 74 20 62 65 20 72 75 6e 20 69 6e 20 44 4f 53 20 6d 6f 64 65 2e 0d 0d 0a 24 00 00 00 00 00 00 00 6c cc 83 dc 28 ad ed 8f 28 ad ed 8f 28 ad ed 8f 95 e2 7b 8f 2a ad ed 8f 21 d5 69 8f 1c ad ed 8f 21 d5 6e 8f 9d ad ed 8f 0f 6b 80 8f 22 ad ed 8f 0f 6b 96 8f 09 ad ed 8f 28 ad ec 8f 2b af ed 8f 21 d5 62 8f 6f ad ed 8f 21 d5 78 8f 37 ad ed 8f 36 ff 78 8f 29 ad ed 8f 36 ff 79 8f 29 ad ed 8f 21 d5 7c 8f 29 ad ed 8f 52 69 63 68 28 ad ed 8f 00 00 00 00 00 00 00 00 50 45 00 00 4c 01 } condition: ($c) or // Matches a total of 8 file(s) out of 10 ($a and $d) or // Matches a total of 10 file(s) out of 10 ($f) or // Matches a total of 3 file(s) out of 10 ($b) or // Matches a total of 4 file(s) out of 10 ($e) // Matches a total of 7 file(s) out of 10 }
This tool doesn’t generate a rule to match the whole set but, rather, generates rules to match subsets of the given set. For example, the string”AU3!EA06″ (rule $e) is matched in 7 files out of the 10 files set I ran the tool against, it doesn’t match against the whole set. Indeed, the unique rule that matches the whole set is the 2nd one ($a and $d). However, this rule is not very useful, to be honest: It just matches a bunch of ‘\0’ characters and the initial bytes of the PE header.
You can download the tool here. I hope you find it useful!