Question about ghostscript command

Hi all, :wave:

on Zum Wochenende: Dream Machines - GNU/Linux.ch I recently found an interesting article dealing with AI (in German though):

Artificial intelligence is the most beautiful and at the same time most threatening concept in the world.
It frightens and impresses at the same time. In principle, it means the simulation of mental processes by all means,
but in general it is one or two forms of computer simulation […]

(translation via “TranslateLocally for Firefox” add-on)

In the article the download-link of a very interesting booklet (132 pages) as a PDF-file is also provided.

Here the “fun” begins. :wink:

I downloaded the PDF with firefox this way: I opened the respective link and the PDF was displayed in a new tab.
Then I “printed” it as a PDF-file from within firefox. It resulted in a 22.8 MB PDF.

Hmm, I know that the direct download with wget often enough gets me a smaller PDF. So I also downloaded it with wget:

wget "http://worrydream.com/refs/Nelson-ComputerLibDreamMachines1975.pdf"

The direct download yielded a pdf with just 15 MB. Well, that´s better indeed. :smiley:

From experience I know that using the ghostscript command on a PDF can also reduce the filesize. In the past I used to employ ghostscript for combining two PDFs, like so:

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine.pdf -dBATCH [FILE1].pdf [FILE2].pdf

I also know the size of a PDF also works without combining two file, it works on one file alone, too.

So I tried it on the two PDFs from above.

Yet there is difference:

file alt_Nelson-ComputerLibDreamMachines1975.pdf Nelson-ComputerLibDreamMachines1975.pdf Nelson-ComputerLibDreamMachines1975.pdf 

alt_Nelson-ComputerLibDreamMachines1975.pdf: PDF document, version 1.5  # via print function in firefox
Nelson-ComputerLibDreamMachines1975.pdf:     PDF document, version 1.6 (zip deflate encoded)  # directly via wget

… and there is a difference when trying to shrink them with ghostscript:

The first one (via firefox´ print PDF function) ran through smoothly :+1: :

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine.pdf -dBATCH alt_Nelson-ComputerLibDreamMachines1975.pdf
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 132.
Page 1
Page 2
Page 3
[...]

… but the combine.pdf got bigger instead of becoming smaller.

The second one (directly via wget) yields a huge amount of status messages but finally also gets the job done (2nd attempt though).

 gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=2combine.pdf -dBATCH Nelson-ComputerLibDreamMachines1975.pdf
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 132.
[...]
Loading CourierNewPS-BoldMT font from /usr/share/fonts/truetype/msttcorefonts/Courier_New_Bold.ttf... 10056724 8550450 18765336 16743137 4 done.
Can't find (or can't open) font file /usr/share/ghostscript/9.55.0/Resource/Font//usr/share/gho.
Can't find (or can't open) font file Candara-Italic.
Loading Candara-Italic font from /usr/share/fonts/truetype/litefonts/Candarai.ttf... 10076924 8561114 19648940 17571273 4 done.
Substituting font NewCenturySchlbk-Bold for CenturySchoolbook-Bold.
Page 122
Substituting font Helvetica-Bold for SimSun,Bold.
Substituting font Helvetica for SimSun.
Substituting font Helvetica-Oblique for SimSun,Italic.
Substituting font NewCenturySchlbk-BoldItalic for CenturySchoolbook-BoldItalic.
Substituting font NewCenturySchlbk-Roman for CenturySchoolbook.
[...]

Its output also got bigger instead of smaller (17.8 MB)

Summary:

22 MB —> 26MB
15MB —> 17MB

ll
total 107M
-rw-rw-r-- 1 rosika rosika  17M Apr 17 17:49 2combine.pdf
-rw-rw-r-- 1 rosika rosika  22M Apr 16 17:06 alt_Nelson-ComputerLibDreamMachines1975.pdf
-rw-rw-r-- 1 rosika rosika  26M Apr 17 18:10 combine.pdf
-rw-rw-r-- 1 rosika rosika  15M Aug 10  2013 Nelson-ComputerLibDreamMachines1975.pdf

I guess this unusual behaviour has something to do with strange fonts used in the original PDF. O.K., if I accept that as an explanation there still remains this
question:
Why is so much font substitution going on only in one PDF but not in the other :question:
I guess the font substitution sees its result in the newly created 2combine.pdf.

Perhaps because of different types of PDFs:

file *
2combine.pdf:                                PDF document, version 1.7
alt_Nelson-ComputerLibDreamMachines1975.pdf: PDF document, version 1.5
combine.pdf:                                 PDF document, version 1.7, 132 pages
Nelson-ComputerLibDreamMachines1975.pdf:     PDF document, version 1.6 (zip deflate encoded)

What might you think of it?

Many greetings from Rosika :slightly_smiling_face:

I have experienced that. Printing to PDF always leads to a larger .pdf file.

The file downloaded with wget has zip deflate encoding. That is a form of compression…
So I think when gs processed it , it may have decompressed it, hence it become larger.

I dont see why the print to pdf version would have become larger with gs
gs is an old program and I am not sure how well updated it is. It may make an old pdf version.

You are right about font substitution, that can change sizes.

Try qpdf --check on the files. It may give you some info
See also qpdf --linearize

https://qpdf.readthedocs.io/en/stable/

Regards
Neville

2 Likes

I’ve come across insanity like this (e.g. print a 25 k MS Word document to PDF and get a 10 MB file too large to email) before - can’t remember what I did with these.

But other formats, when it’s happened to me in the past, with PDFs that have lots of images (i.e. EVERY page is scanned bitmap) - I’ve used CLI tools to extract the images (JPG ? ) and downsampled them (compression, lossiness and converted to monochrome) - then back to PDF again, I’ve reduced e.g. a 25 MB PDF file to under 5 MB… **

It’s worth doing, there are still email systems out there that barf on 5+ MB attachments…

** I think there’s some neat CLI PDF tools - but I mostly used imagemagick (either convert, or mogrify) on the image files…

2 Likes

Hi and thanks for your replies, :wave:

@nevj :

Thanks. Neville, for the confirmation. That´s good to know.

I see. That sounds plausible indeed.

Thanks for the suggestion.

  • for the firefox print/PDF version:
qpdf --check alt_Nelson-ComputerLibDreamMachines1975.pdf 
checking alt_Nelson-ComputerLibDreamMachines1975.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect
  • for the wget version:
qpdf --check Nelson-ComputerLibDreamMachines1975.pdf 
checking Nelson-ComputerLibDreamMachines1975.pdf
PDF Version: 1.6
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect

They appear to be the same except for the PDF version.

O.K., I still have to do that. Thanks for the link, Neville. :heart:

@daniel.m.tripp :

Thanks, Dan, for your comments as well.

Wow, that´s impressive, Dan. :+1:

I also like imagemagick a lot. I think it´s also used (by default) in w3m to display images if needed.

@all:

Well, I think the main part is pretty clear to me now. Thanks a lot for your help.

The only thing that still puzzles me is why I get all those substitution messages (for fonts) only with the wget-downloaded file and not with the firefox print/PDF… :thinking:

Of course that´s not all too important. I was just wondering.

Many greetings from Rosika :slightly_smiling_face:

1 Like