Analyzing Malicious PDF and Word Documents

How to address PDF files and Word Documents and extract malicious indicators from within them. This is something that happens nearly every day in a SOC. An example could be: A user reports phishing and our job as Security Anaylsts is, figuring out if these files are indeed malicious. Safely, quickly and accurately.

The easisest way to seclude malicious files it to put them into a virtual machine and then isolate that virtual machine from the rest of your network and your own personal computer. I will be using REMnux, a VM that is full of RE tools already pre-made.

In this post I will be looking at a PDF file and a Word Document, both commonly used for phishing.

Malicious PDF

For the PDF exmaple I chose a real-world example from any.run. Going to public submissions, we can find a ton of files which were submitted by other people. The filter I applied was: File: Adobe PDF, Verdict: Malicious.

Download the archive, and unzip it. The password for files like this will almost always be infected.

As RecentPurchase.pdf is probably malicious, we want to avoid opening it. There's an easy way to extract what might be inside. Many malicious PDFs contain a link which try to trick the user into clicking that link. One of the easy ways we can do this is using the strings command. To get a fast answer - without having a sandbox - is looking for the letters http anywhere inside the PDF.

We can do this via the following command

remnux@remnux:~/Downloads$ strings RecentPurchase.pdf | grep http
<</IsMap false /S /URI /Type /Action /URI ([https]://0x7signin463ewgs[.]nolcarrybackcaresact[.]com/aTcqdFa)>>
remnux@remnux:~/Downloads$

This PDF probably is a malicious phishing attempt to get people to go the above link.

If you don't have a sandbox, you don't want to open it or don't have any other way of figuring out if the file is malicious, this is the manual, fast and easy way to extracting a link of a PDF without having any kind of danger involved since you are only parsing the PDF byte by byte because you're not opening the file.

If we want to actually see, what the PDF is showing when opened, we can do this using any.run. The cool thing about any.run is, it shows us what the PDF would look like.

This looks like a fake Apple Appstore receipt. Some kind of Apple credential phising page trying to get peoples iCloud credentials.

Malicious Word Document

For the Word Document I will use a sample from Hybrid Analysis. Searching for files requires an account. I used the Advanced Search with the filter: Filetype: doc, Verdict: Malicious.

Often times invoice themed phishing documents are a Word document. Hybrid Analysis already tells us that the file is malicious, but let's see if we can check this the manual way since a user might get a phish that is not yet submitted anywhere.

Word documents are different compared to a PDF. They are much more complex and we generally can't just use strings on them to get meaningful output. However, we can use some tools that are build into REMnux to get indicators if we are dealing with a malicious file or not.

One of the tools we can use is called olevba (ole being the name of the file format).

Download the file, unzip it then we can run olevba on it as follows:

After running this, we will get a summary at the bottom:

Autoexec means theres an automatic execution on Document_open (when the Document is opened) for something trying to happen. Bad sign for any kind of Word Document.

As seen below, it prints the macro above the summary. One way we can tell we're looking at a malicious Word Document is a macro that automatically opens and that macro is heavily obfuscated like in this example:

Seeing this we can quickly go from Is this bad? to This is absolutely bad .

We can go a little deeper on this, since this does not actually tell us what's happening when a user would open the Word Document. When we throw this into a sandbox we can see the following:

The Document runs cmd.exe with a base64 encoded command. If we want to understand what it does, we can copy the base64 encoded part and decode it in the terminal like so:

$ echo "<base 64 encoded part> | base64 -d"

Now we have some heavily obfuscated Powershell, which is not a surprise. Often cmd is used to execute Powershell commands to download and execute more scripts, files etc. We can't really yet see what is really happening but we can notice the Net.Webclient (marked in red above), which is a function commonly used for downloading additional stuff. We can manually try to deobfuscate this, which would be a huge pain.

What we could also do is, copy the Powershell part and run the code inside a Windows VM using Powershell ISE. This way we can see what all the variables of the Powershell command will turn into.

!!! info

Make sure to disconnect the VM from the Network and take a snapshot of the VM prior. If you don't disconnect the VM from the Network, this will infect your machine.

From this we can type Get-Variable to see all variables that were created once this was run.

In a couple of seconds we took obfuscated Powershell and deobfuscated and extracted all the URLs the file tried to reach out to. Now the next step would be to look in our SIEM, Firewall- and Proxylogs and check that no user had gone to any of these URLs. If any user had gone to the URLs we'd need to follow up and see what happened next. We'd also want to block these links to make sure no one is able to get to these URLs.