Analyzing PDF exploits with Pyew

2010, Feb 21    

Something I really hate to do when analyzing PDF malware exploits is to manually extract the streams and manually decode them to see the, typically, hidden JavaScript code, so I decided to extend the PDF plugin for Pyew to automatically see them. Now, with the new version of the plugin (download it from the Mercurial repository) we can see what filters are used in the exploit and, the most important thing, we can see the decoded streams, independently of how many filters are being used.


For example, I will take one obfuscated PDF exploit (SHA256 6a8204ee7b703f96f811f32f903ac9df4045b05910d633fc34fed89e2e0a7576). I will open it in Pyew to see what is inside so, simply, run the command “pyew pdf.file”:

$ pyew sample.pdf

PDF File

PDFiD 0.0.9_PL 6a8204ee7b703f96f811f32f903ac9df4045b05910d633fc34fed89e2e0a7576

PDF Header: %PDF-1.1

obj 4

endobj 4

stream 1

endstream 1

xref 1

trailer 1

startxref 1

/Page 1

/Encrypt 0

/ObjStm 0

/JS 1

/JavaScript 1

/AA 0

/OpenAction 1

/AcroForm 0

/JBIG2Decode 0

/RichMedia 0

/Colors > 2^24 0

%%EOF 1

After last %%EOF 0

Total entropy: 4.293999 ( 5547 bytes)

Entropy inside streams: 3.669587 ( 4773 bytes)

Entropy outside streams: 5.132696 ( 774 bytes)


[0x00000000]> p



1 0 obj


/Type /Catalog

/OpenAction «

/JS 4 0 R

/S /JavaScript

/Pages 2 0 R


2 0 obj


/Type /Pages

/Kids [ 3 0 R ]

/Count 1


3 0 obj


/Type /Page

/Parent 2 0 R

/Resources «

/Font «

/F1 «

/Type /Font

/Name /F1

/Subtype /Type1

/BaseFont /Helvetica

/MediaBox [ 0 0 795 842 ]


4 0 obj


/Length 4769

/Filter [/ASCIIHexDecode /ASCII85Decode /#4c

What we see in Pyew? The output of PDFId (a great tool by Didier Stevens) as well as the hexadecimal output of the first block (512 bytes). Taking a brief look to the 1st block of data we see one “OpenAction” to execute JavaScript. Surprise. The code “/JS 4 0 R” specifies that the JavaScript code to be executed is the object number 4. Seeking to the offset where the object #4 is and printing the buffer (in ASCII) we will find the following:

[0x000001b7]> s 0x1b7
[0x000001b7]> p
4 0 obj
        /Length 4769
        /Filter [/ASCIIHexDecode /ASCII85Decode /#4c#5a#57De#63#6fde /R#75nLen#67t#68#44ecod#65 /FlateDecode ]

The object is multiple times encoded and, which is more, the strings to specify what filters must be used in order to decode the stream are encoded too. It’s perfectly legal according to the PDF specifications, although pretty suspicious. Pyew does a good job decoding both the encoded strings and the multiple times encoded stream. To see the streams just type “pdfvi” to see the encoded streams in the console:


Wow! it’s a small chunk of JavaScript data 😉 Pyew automagically applied all the filters needed (ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode and FlateDecode) and printed out the obfuscated code. We can see it, too, in a graphical user interface. Instead of typing “pdfvi” execute the command “pdfview”. You will see the following screen:

Obfuscated Stream View

Obfuscated Stream View

More Examples

OK, so we can see now the encoded stream but, what if there are a lot of encoded streams and we must check them all or if we want to see just one of them? For this purpose, and also to show the Pyew’s APIs, I created an example usage of the PDF API. The example reads all the streams and shows a list of all the encoded streams as you may see in the following snapshot:

Usage example of the PDF API

Usage example of the PDF API

Using this simple screen we can see all the streams or just one specific (encoded) stream. This is the code of this example usage of the Pyew’s API for the PDF format:

#!/usr/bin/env python
  2. import os
  3. import sys
  5. from pyew_core import CPyew
  6. from easygui import choicebox, fileopenbox, msgbox
  8. def main(filename=None):
  9.     if filename is None:
  10.         filename = fileopenbox(msg="Select PDF file", default="*.pdf", filetypes=["*.pdf"])
  11.         if filename is None:
  12.             return
  14.     pyew = CPyew(batch=True)
  15.     pyew.loadFile(filename)
  17.     streams = pyew.plugins["pdfilter"](pyew, doprint=True)
  18.     if len(streams) == :
  19.         msgbox(title="PDF Streams",msg="No encoded streams found")
  21.     l = []
  22.     l.append("About PDF Streams Viewer")
  23.     l.append("See all streams (both encoded and unencoded)")
  24.     for x in streams:
  25.         l.append("Stream %d encoded with %s" % (x, streams[x]))
  26.     l.append("Quit")
  28.     while 1:
  29.         c = choicebox(msg="Select one stream to view it decoded", title="Stream Viewer", choices=l)
  30.         if c is None:
  31.             break
  32.         elif c.lower() == "quit":
  33.             break
  34.         elif c.lower().startswith("about"):
  35.             msgbox(title="About PDF Streams Viewer",
  36.                    msg="Example usage of the Pyew APIs to see PDF streams. Written by Joxean Koret")
  37.         elif c.lower().startswith("see all"):
  38.             pyew.plugins["pdfview"](pyew, doprint=False, stream_id=-1)
  39.         else:
  40.             stream_id = int(c.split(" ")[1])
  41.             pyew.plugins["pdfview"](pyew, stream_id=stream_id)
  43. if __name__ == "__main__":
  44.     if len(sys.argv) == 1:
  45.         main()
  46.     else:
  47.         main(sys.argv[1])

And, that’s all for the moment. I hope you like the new Pyew’s features 😉