Skip to content
Advertisement

Python: Why opening an XFA pdf file takes longer than a txt file of same size?

I am currently developping some python code to extract data from 14 000 pdfs (7 Mb per pdf). They are dynamic XFAs made from Adobe LiveCycle Designer 11.0 so they contain streams that needs to be decoded later (so there are some non-ascii characters if it makes any difference).

My problem is that calling open() on those files takes around 1 second each if not more.

I tried the same operation on 13Mb text files created from copy-pasting a character and they take less than 0.01 sec to open. Where does this time increase come from when I am opening the dynamic pdfs with open()? Can I avoid this bottleneck?

I got those timings using cProfile like this:

from cProfile import Profile
profiler = Profile()
profiler.enable()
f = open('test.pdf', 'rb')
f.close()
profiler.disable()
profiler.print_stats('tottime')

The result of print_stats is the following for a given xfa pdf: io.open() takes around 1 second to execute once

Additionnal information: I have noticed that the opening time is around 10x faster when the same pdf file was opened in the last 15 or 30 minutes, even if I delete the __pycache__ directory inside of my project. A solution that could make this speed increase apply regardless of the elapsed time could be worth it, though I only have 50 Gb left on my pc. Also, parallel processing of the pdfs is not an option since I only have 1 free core to run my implementation…

Advertisement

Answer

To solve this problem you can do one of the following:

  • specify files/directories/extensions to exclude (no realtime scanning) from Windows Defender settings
  • temporarily turn off real time protection from Windows Defender.
  • save the files in an encoded format where Windows Defender cant detect links to other files/websites and decode them on read. (I have not tried)

As “user2357112 supports monica” said in the comments, the culprit is the anti-virus software scanning the files before making them available to python.

I was able to verify this by calling open() on a list of files while having the task manager open. Python used almost 0% of the CPU while Service antivirus Microsoft Defender was maxing out one of my cores.

I compared the results to another run of my script where I opened the same file multiple times and python was maxing out the core while the antivirus stayed at 0%.

I tried to run a quick-scan of a single pdf file 2 times with Windows Defender. The first execution resulted in 800 files being scanned in 1 seconds (hence the 1 second delay of the open() execution) and the second scan resulted in one scanned file instantly.

Explication:

Windows Defender scans through all the file/internet links written in the folder, that is why it takes so long to scan them and it’s why there is around 800 files scanned in the first report. Windows defender keeps a cache of files scanned since powering on the pc. Files not linked to the internet dont need to be rescanned by Windows Defender. But XFAs contain links to websites. Since it is impossible to tell if a website was maliciously modified, files that contain them need to be rescanned periodically to make sure they are still safe.

Here is a link to to the Official Microsoft Forum tread.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement