Analysing Suspicious PDF Files
Triage and initial analysis of suspicious PDF files
Problem
Executable files are always flagged by antivirus tools and increasingly treated as suspicious and untrusted by default. PDF files are instead treated with less suspicion and attackers often use them to trick targets into running malicious code, so as to obtain initial foothold into their machines.
Code obfuscation and other techniques are used in malicious PDF files to bypass antiviruses. Therefore, in case of suspicion, it is useful to check the file manually.
Most users do not know that PDF files can also contain JavaScript code. While these scripts can have legitimate uses such as filling a form, they can also be used for malicious purposes.
Users at risk like journalists and human rights defenders often use PDF files to exchange information, which makes them very likely targets of malicious PDFs.
Solution
Important Notes
The following analysis should not be conducted using:
- The Helpline’s internal infrastructure, or
- a device that contains Helpline’s assets such as SSL certificates, PGP keys, VPN configuration files, password databases, but also documents sent by beneficiaries such as PGP keys, screenshots, etc..; or
- a network used by the organization’s staff or servers.
Incident handlers can use an old formatted device connected to a 4G hotspot.
Introduction
Once you receive the file, make sure to run it through an antivirus, either by doing it yourself or by asking the beneficiary to do it. If this is not possible, generate a hash and check it on VirusTotal or on a MISP instance to see if it has been already identified.
If the file is not identified as malicious through the above techniques, we should proceed with statistical analysis, as detailed below.
The common technique to compromise a user using a PDF file is to create the file with a malicious JavaScript code embedded in it. The user then is pushed through social engineering into opening the file. PDF viewers that provide thumbnails can be used to run the JS code without the user’s intervention.
To determine if the file meets such criteria for suspicion, we should carry out a statistical analysis for the purpose of identification. We should look for tags that are usually related to the malicious use of PDF files.
PDF file formats
Understanding some basics on the PDF file format will help understand the next steps of the analysis.
A PDF file starts with a header with this format: %PDF-X.Y. X.Y
While some PDF viewers allow the execution of a file even if this header is corrupted, some antivirus tools will fail to analyze the PDF just because this header is missing or corrupted. This trick can be used by attackers to bypass the antivirus.
The rest of a PDF file are the objects, which are the subject of our analysis.
Every object has an index, a version, and a stream, which is the content of that object. Objects can be marked by tags or keywords, which can be found in the stream. These tags and keywords show what an object is meant to do.
This is an example of the first object of a PDF:
obj 1 0
Type: /Catalog
Referencing: 2 0 R, 3 0 R, 7 0 R
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
/OpenAction 7 0 R
>>
The index here is 1, the version is 0. Both are in the first line. The stream is between <<
and >>
.
These indexes and keywords are what we will focus on to analyse a PDF file. The stream is where we will be looking for malicious code.
1. Identification of a potential malicious file
To do this statistic analysis, we can use a tool call PDFiD maintained by Didier Stevens, which you can find at this link.
Please use the SHA256 hash to verify the downloaded file.
To run PDFiD, launch the following command:
$ python pdfid.py pdf_file
Below is the result of the previous command run on a test file containing a simple JavaScript code that would run once the file is open:
PDFiD 0.2.5 /home/user/pdf/js@pdf/Sans_nom.pdf
PDF Header: %PDF-1.1
obj 7
endobj 7
stream 1
endstream 1
xref 1
trailer 1
startxref 1
/Page 1
/Encrypt 0
/ObjStm 0
/JS 1
/JavaScript 1
/AA 0
/OpenAction 1
/AcroForm 0
/JBIG2Decode 0
/RichMedia 0
/Launch 0
/EmbeddedFile 0
/XFA 0
/URI 0
/Colors > 2^24 0
Below are tags and keywords that you should look for. If any of them has a value greater than zero, the file should be considered as suspicious:
/OpenAction and /AA specify the script or action to run automatically.
/JavaScript and /JS specify JavaScript to run.
/GoTo changes the view to a specified destination within the PDF or in another PDF file.
/Launch can launch a program or open a document.
/URI accesses a resource by its URL.
/SubmitForm and /GoToR can send data to a URL.
/RichMedia can be used to embed Flash in a PDF.
/ObjStm can hide objects inside an object stream
We see in our example that /JS and /OpenAction both have a value equal to 1. This shows that there is JavaScript code embedded in this PDF file and there is an automated action.
In a situation like this, and in other situations where the mentioned tags have a value greater than zero, we can consider the file as suspicious and we should proceed to the analysis as shown below.
2. Analysis of suspicious PDF files
For the analysis, we will use a tool that allows us to perform searches within the objects, extract the embedded code, and know when it could be executed.
For this purpose, we can use a tool called pdf-parser that can be download here
Please use the SHA256 hash to verify the downloaded file.
Launch the following command to search for JavaScript code:
$ python pdf-parser.py --search javascript pdf_file
If the tool does not identify any obfuscation, you should have a S
result for the objects where JavaScript is being embedded.
for our example:
$ python pdf-parser.py --search JavaScript ../pdf/js@pdf/Sans_nom.pdf
obj 7 0
Type: /Action
Referencing:
<<
/Type /Action
/S /JavaScript
/JS "(app.alert({cMsg: 'Hello from PDF JavaScript', cTitle: 'Testing PDF JavaScript', nIcon: 3})"
; )
>>
To find out how or when this JavaScript code is called, we should look for references with /OpenAction or /AA using the Index value.
$ python pdf-parser.py --reference Obj_Index pdf_file
>>
So for our example:
$ python pdf-parser.py --search JavaScript ../pdf/js@pdf/Sans_nom.pdf
the result is :
obj 1 0
Type: /Catalog
Referencing: 2 0 R, 3 0 R, 7 0 R
<<
/Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
/OpenAction 7 0 R
>>
Here, obj 1
means that this code is called when the file is opened.
Sometimes streams are filtered by the author or the writing tool. This will be shown by the presence of this tag below:
/Filter [
/FlateDecode ]
If this is encountered, you can show the stream with this command:
$ python pdf-parser.py --object Obj_Index --filter --raw pdf_file * * *
Comments
This guide is based on the work of Didier Stevens. You can read more on his work on PDF files investigation here.
The risky tags and other commands and tools can be found in Lenny Zeltser’s Cheat Sheet.
The techniques detailed in this article can be implemented with other tools, such as peepdf.
REMnux is a Linux distribution that includes more malware analysis tools, including PDF analysis tools. You can use it to double check your results.