Analyzing PDF exploits with Pyew

Something I really hate to do when analyzing PDF malware exploits is to manually extract the streams and manually decode them to see the, typically, hidden JavaScript code, so I decided to extend the PDF plugin for Pyew to automatically see them. Now, with the new version of the plugin (download it from the Mercurial repository) we can see what filters are used in the exploit and, the most important thing, we can see the decoded streams, independently of how many filters are being used.

Example

For example, I will take one obfuscated PDF exploit (SHA256 6a8204ee7b703f96f811f32f903ac9df4045b05910d633fc34fed89e2e0a7576). I will open it in Pyew to see what is inside so, simply, run the command “pyew pdf.file”:

$ pyew sample.pdf
PDF File

PDFiD 0.0.9_PL 6a8204ee7b703f96f811f32f903ac9df4045b05910d633fc34fed89e2e0a7576
PDF Header: %PDF-1.1
obj 4
endobj 4
stream 1
endstream 1
xref 1
trailer 1
startxref 1
/Page 1
/Encrypt 0
/ObjStm 0
/JS 1
/JavaScript 1
/AA 0
/OpenAction 1
/AcroForm 0
/JBIG2Decode 0
/RichMedia 0
/Colors > 2^24 0
%%EOF 1
After last %%EOF 0
Total entropy: 4.293999 ( 5547 bytes)
Entropy inside streams: 3.669587 ( 4773 bytes)
Entropy outside streams: 5.132696 ( 774 bytes)

(…)

[0x00000000]> p
%PDF-1.1
%вгПУ
1 0 obj
<<
/Type /Catalog
/OpenAction <<
/JS 4 0 R
/S /JavaScript
>>
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Name /F1
/Subtype /Type1
/BaseFont /Helvetica
>>
>>
>>
/MediaBox [ 0 0 795 842 ]
>>
endobj
4 0 obj
<<
/Length 4769
/Filter [/ASCIIHexDecode /ASCII85Decode /#4c

What we see in Pyew? The output of PDFId (a great tool by Didier Stevens) as well as the hexadecimal output of the first block (512 bytes). Taking a brief look to the 1st block of data we see one "OpenAction" to execute JavaScript. Surprise. The code "/JS 4 0 R" specifies that the JavaScript code to be executed is the object number 4. Seeking to the offset where the object #4 is and printing the buffer (in ASCII) we will find the following:

[0x000001b7]> s 0x1b7
[0x000001b7]> p
4 0 obj
<<
        /Length 4769
        /Filter [/ASCIIHexDecode /ASCII85Decode /#4c#5a#57De#63#6fde /R#75nLen#67t#68#44ecod#65 /FlateDecode ]
>>stream
4A2E3539605651222D714E634326304C5A47725A236A63494B26682C323A4E532…

The object is multiple times encoded and, which is more, the strings to specify what filters must be used in order to decode the stream are encoded too. It's perfectly legal according to the PDF specifications, although pretty suspicious. Pyew does a good job decoding both the encoded strings and the multiple times encoded stream. To see the streams just type "pdfvi" to see the encoded streams in the console:

eval(unescape("%76%61%72%20%56%68%4C%66%4E%20%3D..."))

Wow! it's a small chunk of JavaScript data ;) Pyew automagically applied all the filters needed (ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode and FlateDecode) and printed out the obfuscated code. We can see it, too, in a graphical user interface. Instead of typing "pdfvi" execute the command "pdfview". You will see the following screen:

Obfuscated Stream View

Obfuscated Stream View

More Examples

OK, so we can see now the encoded stream but, what if there are a lot of encoded streams and we must check them all or if we want to see just one of them? For this purpose, and also to show the Pyew's APIs, I created an example usage of the PDF API. The example reads all the streams and shows a list of all the encoded streams as you may see in the following snapshot:

Usage example of the PDF API

Usage example of the PDF API

Using this simple screen we can see all the streams or just one specific (encoded) stream. This is the code of this example usage of the Pyew's API for the PDF format:

#!/usr/bin/env python
  1.  
  2. import os
  3. import sys
  4.  
  5. from pyew_core import CPyew
  6. from easygui import choicebox, fileopenbox, msgbox
  7.  
  8. def main(filename=None):
  9.     if filename is None:
  10.         filename = fileopenbox(msg="Select PDF file", default="*.pdf", filetypes=["*.pdf"])
  11.         if filename is None:
  12.             return
  13.  
  14.     pyew = CPyew(batch=True)
  15.     pyew.loadFile(filename)
  16.  
  17.     streams = pyew.plugins["pdfilter"](pyew, doprint=True)
  18.     if len(streams) == 0:
  19.         msgbox(title="PDF Streams",msg="No encoded streams found")
  20.  
  21.     l = []
  22.     l.append("About PDF Streams Viewer")
  23.     l.append("See all streams (both encoded and unencoded)")
  24.     for x in streams:
  25.         l.append("Stream %d encoded with %s" % (x, streams[x]))
  26.     l.append("Quit")
  27.  
  28.     while 1:
  29.         c = choicebox(msg="Select one stream to view it decoded", title="Stream Viewer", choices=l)
  30.         if c is None:
  31.             break
  32.         elif c.lower() == "quit":
  33.             break
  34.         elif c.lower().startswith("about"):
  35.             msgbox(title="About PDF Streams Viewer",
  36.                    msg="Example usage of the Pyew APIs to see PDF streams. Written by Joxean Koret")
  37.         elif c.lower().startswith("see all"):
  38.             pyew.plugins["pdfview"](pyew, doprint=False, stream_id=-1)
  39.         else:
  40.             stream_id = int(c.split(" ")[1])
  41.             pyew.plugins["pdfview"](pyew, stream_id=stream_id)
  42.  
  43. if __name__ == "__main__":
  44.     if len(sys.argv) == 1:
  45.         main()
  46.     else:
  47.         main(sys.argv[1])

And, that's all for the moment. I hope you like the new Pyew's features ;)

3 thoughts on “Analyzing PDF exploits with Pyew

  1. Pingback: Security PDF-related links in 2010: analyses and tools

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>