Question regarding the compression of PDF files

Rosika · May 5, 2024, 2:34pm

Hi all,

no problem here, just a matter of discussion.

I just downloaded an e-book (which is currently offered for free via the how-to-geek newsletter) from here and the downloaded pdf files weighs in at 31,0 MB.

To save some disk space for archiving purposes I tend to compress pdf files of a similar size.

Well, to my astonishment the command I used didn´t make it smaller. It even grew a bit in size: 31.5 MB.

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed_PDF_file.pdf w_wile540.pdf .

So I reverted to another ghostscript command I used to run in the past when merging 2 pdf files:

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine.pdf -dBATCH w_wile540.pdf

I guess I wouldn´t have needed the “BATCH” part as there was nothing to merge. I applied it anyway, just out of laziness. I guess it didn´t hurt.

Now the result was much better indeed:

The resulting file was only 22.8 MB in size. Much better now.

Out of interest I took a look at the three variants:

ll combine.pdf w_wile540.pdf compressed_PDF_file.pdf
-rw-rw-r-- 1 rosika rosika 22M Mai  5 15:54 combine.pdf  # result of the second command
-rw-rw-r-- 1 rosika rosika 31M Mai  5 15:51 compressed_PDF_file.pdf  # result of the first command
-rw-rw-r-- 1 rosika rosika 30M Mai  5 15:44 w_wile540.pdf  # original file

and:

file combine.pdf w_wile540.pdf compressed_PDF_file.pdf 
combine.pdf:             PDF document, version 1.7
compressed_PDF_file.pdf: PDF document, version 1.4
w_wile540.pdf:           PDF document, version 1.7 (zip deflate encoded)

Interesting. The original file is marked as “zip deflate encoded”. The first command changed the PDF version to 1.4. The second command just stated version 1.7.

Any ideas why the first command didn´t actually make the pdf smaller

Thanks for your opinions and many greetings from Rosika

Mina · May 5, 2024, 3:17pm

Usually, what makes pdf files big, are included images.

In most cases, these are already compressed and not much can be achieved.

Where something can be done, is in the text portion (which includes formatting) of the pdf.

To see, if anything can be done at all, I would recommend to apply the zip or gzip command to the file. If this doesn’t do much, the pdf is already compressed and there’s no sensible way to decrease its size.

Rosika · May 5, 2024, 3:41pm

Hi Mina,

… and thanks a lot for your reply.
So nice to hear from you again. I hope you´re well.

I see.
That makes sense, of course.

Looking at the file I realized - unlike in many other e-books I downloaded in the past - there are quite a lot of images included, even in colour.
That would account for the relatively big size then.
The book has 339 pages in total.

Well, it´s a good thing then, that one of the ghostscript commands (the 2nd one) still could bring the size down to 22.8 MB.

I could save 8.2 MB after all.

Yes, that´s some good advice.

Thanks a lot, Mina.

Many greetings from Rosika

callpaul.eu · May 5, 2024, 4:49pm

Not going to answer your question as not capable on compression technology.

But adobe offer a free tool on the site

https://www.adobe.com/acrobat/online/compress-pdf.html

Which i would normally use rather than a command line, out of lack of knowledge on my part.

Rosika · May 6, 2024, 12:44pm

@callpaul.eu :

Thank you very much, Paul, for providing the adobe link.

So there´s an online tool for compressing PDF files.
Yet I wonder what they would be capable of achieving, taking into account Mina´s information:

But it´s good to know there´s an online alternative.

Basically I´m quite happy with the ghostscript commands.
I was just wondering why the first one didn´t achieve any compression at all whereas the second one was able to reduce the file size more or less effectively… .

Never mind.
Thanks and many greetings from Rosika

nevj · May 6, 2024, 12:56pm

Hi Rosika,
There is a program called pdfsandwich
It is for pdf files obtained by scanning… these scanned pdfs
are all image, so they tend to be large.
pdfsandwich will convert these to text, using OCR, and tidy
up scanning issues like black edges.
You get a much smaller file, but it is only for scanned pdfs.
Regards
Neville

Rosika · May 6, 2024, 1:16pm

Hi Neville,

thank you very much for this information.

That was new to me, so I looked it up on ubuntuusers (in German; but can be translated with e.g. “TranslateLocally for Firefox” add-on).

It says:

pdfsandwich is a command line tool for creating searchable PDF files.
PDF files created with word processing programs can easily be searched, unlike with pure image templates, which were created, for example, with scanners for archiving (paperless office, digitisation of old documents, etc.).

Sounds like a beast of a programme. Glancing over the page (although in German) I found it hard to understand all of its intricacies.
But it´s interestng.

O.K. I´ll look into it.

Thanks again and many greetings from Rosika

Rosika · May 14, 2024, 2:29pm

Hi again,

today I downloaded another book via tradepub (offered by How-to-Geek newsletter).

The process one has to follow is:

place an order via the “offer” link provided by the HTG newsletter
receive an e-mail with the download link
enter this link in the browser´s URL bar and hit “Enter”

Then the download started.

But it wasn´t a “regular” download, the result of which you would find in the download folder. Rather it seems to have been a pdf file which was directly displayed in the browser.

To save the file either use “save page as” or “print” it as a pdf file. I used firefox for this purpose.

What struck me as odd is that the file used up over 80 MB to download.

Well, there is quite a variety of couloured images in it (545 pages altogether) , but still…

Then I looked at my dowloads folder and realized the filesize was considerably smaller than the amount of data used for the download.

After that I applied my 2 ghostscript commands to bring the size down even further.
Either of them would have sufficed. I used both to see what difference they would make.

Here´s a summary of the data involved:

download: over 80 MB
“printing” the downloaded pdf file via firefox: 32.2 MB
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed_PDF_file.pdf downloaded.pdf :
16.5 MB
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine.pdf -dBATCH downloaded.pdf : 8.5 MB

I really couldn´t notice any difference regarding quality between the pdf file I got from the browser and either of the ghostscript commands.
I especially looked at the photographs which are contained in the e-book.
No noticeable degradation.

Still I don´t quite get it:

Why on earth did I need to download over 80 MB worth of data in the first place
The data consumption seems to be a tad bit too high for my taste… .

Many greetings from Rosika

nevj · May 14, 2024, 11:23pm

Hi Rosika,
I have experienced the same thing, but I do not know the
explanation. What you detail is not a one-off.
Regards
Neville

daniel.m.tripp · May 15, 2024, 2:34am

I’ve seen “similar” when scanning documents to PDF…

e.g. a bunch of forms I’ve filled out (with pen and ink) and fed through my Brother MFC (I just save via FTP to my NAS) to PDF.

Ended up with like 30+ MB filesizes - and I’m expected to email that as an attachment and keep it under 5 MB… And the lame organisation didn’t make their PDF files “online forms” - so had to printed, then completed by hand.

So - I think I used some pdf2??? command line to convert the images to jpg or png… Then I used imagemagick (mogrify -resize) to downsize the image files to 50% or maybe even 25%, then I think I used imagemagick again to create a new PDF with the resized images (or maybe some other img2-PDF CLI).

I don’t have the pdf2??? or img2-PDF command line tools installed currently - as haven’t needed them for a couple of years…

If I was to do it again - I’d probably scan straight to JPG from my Brother MFC, via FTP to my NAS - then verify the images aren’t too huge - then - if the requirement was 100% for PDF - convert them to a single PDF.

Note :
just looked in my CLI history and :

 1153  sudo apt install pdftk
 1154  pdftk document_000556.pdf cat 1-endwest output pagey000.pdf
 1155  pdftk document_000563.pdf cat 1-endwest output pagey001.pdf

I think was rotating a page or something?

So it was probably pdftk I used…

nevj · May 15, 2024, 3:04am

I wonder would qpdf do anything useful.?

This whole thing needs a tidy up… it is unworkable

Rosika · May 15, 2024, 1:40pm

Hi again,

thanks @daniel.m.tripp and @nevj for your replies.

@nevj :

O.K., that´s good to know; Neville.

I was just wondering why the download would take up over 80 GB of data. Seems pretty insane to me, especially in view of the fact that saving it with firefox resulted in a 32.2 MB file.
And I´m not talking about my ghostscript commands here.

Oh well, we´ll have to accept the facts then.

No idea, Neville. I´ll have to look into it.

But I´m quite satisfied with my 2nd ghostscript command, which could bring the file size down to 8.5 MB. That´s pretty good for my archiving purposes, I think.

Thanks again, Neville

@daniel.m.tripp :

Oh dear. That´s pretty hefty for a pdf file which contains just one or to forms.

Yes, that´s the problem with e-mail providers. An attachment can´t be indefinitely huge in size.

Thanks for bringing up pdftk.

I also use that quite often.
There are some situations when I have to rotate a bunch of PDFs as well.
This command takes care of it quite effectively:

for f in *.pdf; pdftk $f cat 1-endeast output (string replace '.pdf' '_90.pdf' $f); end

(fish syntax)

Thanks, Dan.

Many greetings to all of you from Rosika

callpaul.eu · May 15, 2024, 6:43pm

I have started getting more pfd files from the bank where they are to sign and then return. Not checked size in either direction. But i imagine these must vary due to the option of signatures. Many I sign page 1 and it then transfers the details to several other pages.

Along with images.

nevj · May 15, 2024, 11:23pm

Hi Rosika,
I will try and guess.
It is a packet switching network.
It puts small blobs of data in a packet which includes other things like the address it is to go to, the sending address, …
So each packet is larger than the data it sends
The download measuring software measures the total bits transmitted… including packet overhead… ie it measures the ‘traffic’
When the packets are received they are disassembled and the data extracted… it is amazingly complicated, because the packets may not even arrive in order… they have to be
reordered by the receiver.
There may be dropouts, packets may be lost, and may have to be re-sent… that adds to the bits transmitted.
There are messages sent as well as packets… each packet received correctly has to be acknowledged, otherwise it will be re-sent.
Can you see, therefore, that the ‘traffic’ will be larger than the file?

Not sure, but my best try. 80Gb still seems a lot.
There may have been a lot of retransmitted packages if some part of the internet connection was not working well.
Regards
Neville

Rosika · May 16, 2024, 12:40pm

Hi all,

@callpaul.eu :

Thanks for the info.

Signing PDFs normally involves printing them, signing them, scanning them and then sending them back, right?
Do you do it this way?

My routine for doing things like that is this:

I scanned my signature, so that have it available as a jpg file (I only had to that once)
I import the PDF document which needs to be signed in either gimp or (even better, because simpler) in xournal
I import the scanned signature here as well, put it in the right place and either merge the 2 layers (in gimp) or export the new file directly (in xournal as a PDF.
Then I send the newly created PDF to wherever it needs to be.

I like doing it with xournal. It´s much simpler and quicker this way.

Saves me a lot of time, ink and nerves.

@nevj :

Thank you so much, Neville.

I certainly lack the expertise for analyzing the scenario in such detail.
Your explanation is very good and informative. It makes a lot of sense to me.

That would certainly explain the higher demand for data.

I certainly can, Neville.
I understand now.
But, like you said, 80 GB seems a lot. God knows what was happening in the background.
But you explained it very well.

I was just curious about a possible explanation because I have to take care of my downloads, data-wise.

But this time there was no way of knowing the download size in advance. In fact I tried:

firejail wget --spider 'download_URL'

to see beforehand how much data would be needed.
This command usually works very well but this time the URL referenced the page which, when put into the URL bar of firefox initiated to download and displayed the result as a PDF file.

So the wget command just got me:

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.

Without knowing the fiesize in advance I was astonished to quite some extent that 80 MB was needed.

Oh well.
The main thing is I could bring down the filesize to 8.5 MB for archiving purposes.

Thanks a lot to all of you for your help.

Many greetings from Rosika

callpaul.eu · May 16, 2024, 1:03pm

No, i used to get them like that, print copy signature etc.

But now in acrobate you can set up a sign paper just like word forms, when you click in the area a tool bar appears with the image of a pencil and you click that, then a box appears on the form in pdf, you click and sign, keeping the mouse pressed, that then signs that bit, plus if linked through the document it re appears on every page.

Typical forms here require you sign first and last page then initial every page.
At one stage you also had to write, bonne pour d’accord plus date and sign, but that has almost disappeared thankfully.

https://www.adobe.com/fr/acrobat/online/sign-pdf.html

Before we had another system

But that was more business usage due to costs

Rosika · May 16, 2024, 1:11pm

Hi Paul,

I see.
I didn´t know about the acrobat method. That was new to me, I have to admit.

Thanks for letting us know and thanks for the links as well.

Many greetings from Rosika

nevj · May 16, 2024, 1:49pm

Hi Rosika,

I found this

Apparently pdf files can contain fonts.
Your gs probably removed the fonts.

Regards
Neville

Rosika · May 16, 2024, 2:04pm

Hi Neville,

thank you so much for doing a lot of research on my behalf.

I´ll read through what they tell on the website the link to which you kindly provided.

I wouldn´t have thought of that, as the result of my final PDF seems to look exactly the same as the original.

Thanks a lot.
Many greetings from Rosika

nevj · May 16, 2024, 2:05pm

You must have the fonts on your system… otherwise it would look different