Minutely different results when downloading PDF via wget and browser

Rosika · July 30, 2020, 11:55am

Hi altogether,

I´d very much like to ask your opinion about a curious matter.

I´ve downloaded an e-book from distrowatch “Bash Special Characters” as a pdf-file (via tradepub.com).
You may find it here: https://www.tradepub.com/free/w_howt03/prgm.cgi?a=1 .

For the download I received a link via e-mail which I copied into my browser and the download started.

Out of curiosity I copied the link provided within the browser (for initiating the download manually in case the automatism failed) and
started a second download via terminal with wget.

Both ways worked well but when I performed an md5sum check on both files I got different results:

These are the two PDFs (I renamed them in order to make the whole thing clearer):

-rw-rw-r-- 1 rosika2 rosika2 1,6M Jul 29 16:57 index.pdf  # via wget
-rw-rw-r-- 1 rosika2 rosika2 1,6M Jul 29 16:55 w_howt03.pdf  # via browser

But the command md5sum *.pdf

got me the following results:

47e4dd0b6cded36edd527480774fc6ff  index.pdf
8c5d48f7db53bbf8f8eecb02365b1ea6  w_howt03.pdf

Hmm. They should be identical.
As a next step I extracted the text-part with pdftotext. Now the content of my folder looked like this:

-rw-rw-r-- 1 rosika2 rosika2 1,6M Jul 29 16:57 index.pdf
-rw-rw-r-- 1 rosika2 rosika2  17K Jul 29 18:02 index.txt
-rw-rw-r-- 1 rosika2 rosika2 1,6M Jul 29 16:55 w_howt03.pdf
-rw-rw-r-- 1 rosika2 rosika2  17K Jul 29 18:03 w_howt03.txt

Performing an md5sum check on the resulting text-files got me this:

md5sum *.txt
e4778d644c47203ff7ef31ad6e766701  index.txt
e4778d644c47203ff7ef31ad6e766701  w_howt03.txt

Wow. The text part is completely identical. So the difference must lie somewhere else.

Looking through the PDFs manually with evince showed no visible difference either.

As a last step I resorted to pdfinfo to see if I could spot anything there.
And bang! There´s just one single difference in the output:

File size:      1645832 bytes  # howt03.pdf
File size:      1645848 bytes  # index.pdf; the one I downloaded manually via wget

It seems that the copy I got via wget is exactly 16 bytes larger than the one I downloaded with the browser.

It may not be especially important but it seems interesting. Why may that be?
Does anyone have any ideas?

Thanks in advance.
Greetings.
Rosika

Akito · July 30, 2020, 12:23pm

This is very curious, indeed.

I couldn’t find anything regarding this in a quick search. I am used to wrong links resulting in wrong downloads, which I assumed at first here, but both PDFs are actually functional and seem identical, even though they are technically different.

I think only someone who actually knows this tool in and out could answer this question properly. Perhaps open an issue on the wget repository.

01101111 · July 30, 2020, 12:30pm

to further @Akito’s comment about possibly opening an issue, it might be helpful to see if the outcome is reproducible with any other pdf (or multiple pdfs if you want to see if it is more than a one- or two-off occurrence) you can download in the same two ways.

Rosika · July 30, 2020, 12:40pm

@Akito and @01101111:

Thanks a lot to both of you for your comments.

Yes, quite true. I couldn´t find the slightest difference by taking a look at the PDFs.

Good idea. I´ll do that as soon as possible.

Well, it´s not an issue in the real sense.
To be honest I was reluctant to post it here in the first place, Because, let´s be honest, it´s not much of a problem…

As soon as I can tell more I´ll post it here.

Thanks a lot.
Greetings.
Rosika

01101111 · July 30, 2020, 12:51pm

i ran a quick test on this pdf: http://www.linuxfromscratch.org/lfs/downloads/stable-systemd/LFS-BOOK-9.1-systemd.pdf and the md5sum was the same when i used wget as the one i had downloaded a few days earlier with my browser.

i realize none of that actually gets at the why. maybe part of the why is in the what? i tried running diff on two entirely different pdf’s and just got “Files pdf1 and pdf2 differ” so perhaps it doesn’t work on pdf’s the way it does with regular text files by listing the lines that differ and their content, but was wondering if you might get some hint as to what the difference was.

as i was typing that, i figured there was probably already a specific program for that: diffpdf.

Rosika · July 30, 2020, 1:06pm

Hi and thanks,

Interesting. So you got the exact same copy of the PDF.

I found yet another way to compare two PDFs (see: Diff of two pdf files? - Ask Ubuntu ):

The author suggests the following syntax:

diff <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

This by the way is another thing I wanted to ask.
I don´t quite comprehend the syntax.
Any ideas how the /dev/stdout-part fits into the whole expression?

Greetings.
Rosika

01101111 · July 30, 2020, 1:30pm

i am not saying that command won’t work, but just looking at it seems to suggest that you would be diffing the same (depending on if you used -layout previously) output as the text files you created with pdftotext. my supposition is that the mystery 16 bytes are somewhere (somehow) in the layout.

i was unsure of the use of < which i found here:

python hello.py < foo.txt      # feed foo.txt to stdin for python

i have seen stdin and stdout mentioned previously, but never really understood the use or meaning. this article calls them streams (in the case of stdout and stderr) that accept input from the command shell.

my read on that part is the command is putting the text of the two pdf’s into stdout so it can then diff the output without having to create the actual text file in say a tmp directory and then diff those.

of course a lot of that would depend on whether or not the command works

Akito · July 30, 2020, 1:57pm

I prefer interesting discussions over fixing someone’s “I can’t install Ubuntu LTS, please help, but I give you no more information than the distro name” for the 100th time.

Rosika · July 30, 2020, 2:13pm

Thanks again,

I just tried out the command once more (already did it yesterday). Yet this time I added the “-s” parameter which simply says if the two files are the same:

rosika2@rosika2-Standard-PC-i440FX-PIIX-1996:~/Desktop/kgw$ diff -s <(pdftotext -layout index.pdf /dev/stdout) <(pdftotext -layout w_howt03.pdf /dev/stdout)
Files /dev/fd/63 and /dev/fd/62 are identical
rosika2@rosika2-Standard-PC-i440FX-PIIX-1996:~/Desktop/kgw$ echo $?
0

So the commad definitively works (as suggested).

BUT: It seems the “/dev/stdout”-part isn´t necessary after all:

  rosika2@rosika2-Standard-PC-i440FX-PIIX-1996:~/Desktop/kgw$  diff -s <(pdftotext -layout index.pdf) <(pdftotext -layout w_howt03.pdf)
    Files /dev/fd/63 and /dev/fd/62 are identical
    rosika2@rosika2-Standard-PC-i440FX-PIIX-1996:~/Desktop/kgw$ echo $?
    0

This got me the same result.

I understand that the command basically does the same as my manual diff-ing of the two text-files.
I was just curious about the syntax.

O.K. That makes sense. I guess we´ll have to leave it at that. But thanks so much for your insights.

Greetings.
Rosika

Rosika · July 30, 2020, 2:15pm

Hi Akito,

that´s really kind of you.
I already feared I was getting on everybody´s nerves with my question.

Many greetings.
Rosika

01101111 · July 30, 2020, 2:27pm

i also enjoy a well-presented and reasoned discussion from time to time

of course what you do with your own time is absolutely your prerogative, but did you try diffpdf? i ask purely out of curiosity to see if it might have shown anything.

further down the article about stdout there was a possibly relevant section about how the same command could present the information requested differently depending on whether or not it was piped through another command:

$ ls
Desktop Downloads Pictures screengrabs test vms
Documents Music Public Templates Videos

vs.

$ ls|cat
Desktop
Documents
Downloads
Music
Pictures
Public
screengrabs
Templates
test
Videos
vms

i’m not suggesting that wget piped the pdf through anything, but merely that it might be possible that how it dealt with the pdf could have been just slightly different than whatever command your browser used to do the same

thank you as well for another interesting discussion.

Rosika · July 30, 2020, 2:41pm

Hi,

Actually I didn´t.
As soon as I found out about the above discussed lengthy command I used that.
But you´re right. I will try it and report back.

Wow, what a profound insight. That might really be an explanation.
I wouldn´t have come to this conclusion myself. Thanks a lot.

Greetings.
Rosika

P.S.:

I just wanted to thank you for the link Bash scripting cheatsheet ( Bash scripting cheatsheet). It seems awfully interesting.
As I write a few scripts myself from time to time (yet more or less at a beginner level) it certainly will come in handy .

Rosika · July 30, 2020, 3:20pm

Hi all,

I´ve now installed diffpdf and performed the comparison on the two PDFs with the command
diffpdf index.pdf w_howt03.pdf .

As a result I can say that diffpdf couldn´t find any differences either:

So my best guess is that going along with @01101111 ´s explanation

makes the most sense.

Once again: thanks to all of you for discussing that matter with me.

Greetings.
Rosika

Akito · July 30, 2020, 4:35pm

That however would mean, that wget or the browser needs to interpret the file as something, instead of just downloading a ByteArray or ByteStream. This could lead to issues, if a weird PDF is not correctly interpreted. I am speculating here, but I don’t see the point why a downloads program needs to interpret what it is downloading.

I think it’d be really helpful to ask the experts on the wget repository. I want to know the solution to this curiousity, as well.

Rosika · July 30, 2020, 4:59pm

Hi,

Yes, good point. Thanks.

I´d also like to submit the “issue” to the people in charge.
I looked a bit around and found two sites:

and
Bugs : wget package : Ubuntu .
Not quite sure which to use. Perhaps both of them.

I´ll keep you posted.

Greetings.
Rosika

01101111 · July 30, 2020, 5:10pm

part of the reason i poke my head in to discussions like this that i don’t know if i know anything about is in hopes of finding something new to learn try as i might to make it through a dry bash scripting tutorial or pdf, my mind wanders and i find myself doing something entirely different. situations like this where i can run commands i hadn’t seen or used before (like diff and pdftotext) stick in my brain much better.

thank you for taking the time to satisfy that one tiny curiosity.

my addition about wget and the piped vs stdout was purely speculation as well. my understanding of networking in general is fairly basic, but i suppose i was seeing a situation in which the whole document was spread across different packets and at some point had to be “reassembled” into the actual pdf.

or that could be the star trek talking?

1crazypj · July 30, 2020, 6:56pm

I would imagine it’s a form of reference for tradepubs so you can read it on their site?
Either that or an NSA reference in case you started looking for bomb making instructions (or something?)
Yep, still paranoid, and, I have nothing to hide ;o)

daniel.m.tripp · July 31, 2020, 9:14am

Fascinating topic… (no I’m not being facetious)…

I’d like to see more interesting things like this posted here…

I went to download that PDF, but got stymied by the whole “what’s your mother’s maiden name” interrogation (only kidding - but I am being facetious)…

I was going to see if I could repeat this behaviour - and - compare browser download (Google Chrome) vs wget, and compare these two to curl (which I kinda prefer - but I probably use wget and curl interchangeably - but “curl icanhazip.com” is one of my favourite things you can do on a UNIX or Linux server)…

Rosika · July 31, 2020, 12:28pm

Hi,

The same with me.

You´re welcome. I also wanted to know if this to me hitherto unknown command produces any results which point to differences between the two PDFs.
Yet we´ve seen that this isn´t the case…

Rosika · July 31, 2020, 12:30pm

Hi,

thanks a lot. I´m glad that this topic is of interest to anyone else besides me.