Minutely different results when downloading PDF via wget and browser

Rosika · July 31, 2020, 12:51pm

@all:

As this topic keeps me bugging quite a lot (but in an interesting sense, not an annoying one) I did the following:
In order to see how the pdf http://www.linuxfromscratch.org/lfs/downloads/stable-systemd/LFS-BOOK-9.1-systemd.pdf suggested by @01101111 behaves with me I did the same.

I dowloaded it via the browser (chromium BTW) and via wget. And just as @01101111 reported the md5sum is exactly the same.
So I suppose there´s little point in reporting a “bug” to the wget maintainers as wget doesn´t seem to be the culprit.

BUT:

As an additional step I thought: give the comparison another try by downloading another book from tradepub in order to see if could replicate the initial behaviour.

So I downloaded the book “Linux from Scratch” (version 7.4) which seems to be an earlier version of what @01101111 downloaded himself.

The results are the following:

rosika@rosika-Lenovo-H520e /m/r/f/D/D/p/prov> 
md5sum wget_Linux_from_Scratch.pdf browser_Linux_from_Scratch.pdf 
da65d66d0dfd995d7fd4f7e7327506b3  wget_Linux_from_Scratch.pdf
6ec4ff88e8884c61587e124af2e6181d  browser_Linux_from_Scratch.pdf

Once again: they aren´t identical !

Furthermore I looked into the filesize:

rosika@rosika-Lenovo-H520e /m/r/f/D/D/p/prov> 
pdfinfo wget_Linux_from_Scratch.pdf | grep 'File size'
File size:      959330 bytes
rosika@rosika-Lenovo-H520e /m/r/f/D/D/p/prov> 
pdfinfo browser_Linux_from_Scratch.pdf | grep 'File size'
File size:      959314 bytes

And once again: the otherwise exact same copy of the PDF is exactly 16 bytes larger when downloaded
via wget.

So to sum up: yes, the described phenomenon can be replicated. With the same results. Fascinating.

Yet this seems to be restricted to downloading via tradepub (browser).
Counterexample was provided by @01101111 .

What a story.

Greetings to you all.
Rosika

01101111 · July 31, 2020, 1:17pm

i resisted downloading the pdf you previously mentioned for the same reason as @daniel.m.tripp: the registration questions just seemed to be a bit more than i wanted to part with. since it seemed possible the pdf was licensed or copyrighted i didn’t ask for the link you used, but it seems unlikely that the lfs 7.4 pdf you downloaded would be. if you don’t mind sharing that link, i would give the process a go with firefox and wget.

of course it also gets interesting (or so it would seem) if you add in curl as suggested previously by @daniel.m.tripp and other browsers.

Rosika · July 31, 2020, 1:24pm

Hi,

O.K. No problem. Thanks for the effort.

Just to be clear: which link exactly do you need?

01101111 · July 31, 2020, 1:28pm

if i was reading correctly that you were able to download it from tradepub, that was the link/pdf i was referring to

Rosika · July 31, 2020, 1:50pm

Hi,

here´s the link I got from tradepub

If you click on the link you´ll be directly forwarded to the tradepub-site and the download begins automatically.
That´s the browser download.

On the same page its says: " Linux from Scratch should download immediately. If it doesn’t, please click here to force the download."

Copy the “click here” part into the clipboard to use it for wget-download.

Thanks and good luck.
Greetings.
Rosika

Rosika · July 31, 2020, 2:02pm

Hi again,

it seems to be getting more and more interesting.
Now I tried curl.
Here are the results:

md5sum curl_Linux_from_Scratch.pdf wget_Linux_from_Scratch.pdf browser_Linux_from_Scratch.pdf 
da65d66d0dfd995d7fd4f7e7327506b3  curl_Linux_from_Scratch.pdf
da65d66d0dfd995d7fd4f7e7327506b3  wget_Linux_from_Scratch.pdf
6ec4ff88e8884c61587e124af2e6181d  browser_Linux_from_Scratch.pdf

Just look at that: curl behaves the same way as wget does. Both copies are identical.

And:

pdfinfo curl_Linux_from_Scratch.pdf | grep 'File size'
File size:      959330 bytes
pdfinfo wget_Linux_from_Scratch.pdf | grep 'File size'
File size:      959330 bytes
rosika@rosika-Lenovo-H520e /m/r/f/D/D/p/prov> 
pdfinfo browser_Linux_from_Scratch.pdf | grep 'File size'
File size:      959314 bytes

The same with filesize.

Greetings.
Rosika

Akito · July 31, 2020, 2:02pm

I was actually suggesting to “misuse” the bug tracker for a question regarding this topic. The problem is that if you would ask in a normal forum, as it is the case now, the probability is almost zero that you find someone who knows wget so deeply, that they would be able to answer the question well enough.

Rosika · July 31, 2020, 2:08pm

Hi,

thanks for the clarification.
I see.
Taking the latest findings into account it seems that doing the same with curl might (or might not) yield results as well.
Cos´ it turned out that wget and curl actually got the exact same PDFs.

Greetings.
Rosika

01101111 · July 31, 2020, 2:44pm

looks like there is a timestamp or timeout included in the link/email. i was able to get a copy with firefox, but epiphany took me to another download page where it asked me to register and wget grabbed an html doc that lead to same or similar. i tried the initial link again and got:

The link to ‘Linux from Scratch’ has expired. Please register below to get your free download.

Rosika · July 31, 2020, 2:58pm

Hi,

I´m sorry to hear that.
It´s the same with me. So just tried to re-request the PDF and got the following link:
https://www.tradepub.com/?p=w_linu01&w=d&email=3bernhard@tempr.email&key=SZEc7dk8qx98XkSfHUBD&ts=7245&u=0821130991791596207244&e=M2Jlcm5oYXJkQHRlbXByLmVtYWls&s=myacct

It says “open now”.
I haven´t clicked on it yet in order to give you a chance in case you want to try it again.

Cheers.
Rosika

01101111 · July 31, 2020, 3:05pm

no apologies necessary. it is just an interesting diversion to look into

i am getting a similar html doc from wget. there is no mention of link expiration this time. just an offer to download another copy. i admit that i hadn’t used it or curl before. are you adding any options or just wget link?

Rosika · July 31, 2020, 3:14pm

Well, I always use wget this way:

firejail wget "[link]"

Of course the firejail part isn´t really necessary. I tend to firejail almost everything. A bit paranoid
perhaps.
Yet even the developer provides a dedicated profile for it (wget.profile). So why not use it…

For your purposes I think this command should do (hopefully):

wget “https://www.tradepub.com/?p=w_linu01&w=d&email=3bernhard@tempr.email&key=SZEc7dk8qx98XkSfHUBD&ts=7245&u=0821130991791596207244&e=M2Jlcm5oYXJkQHRlbXByLmVtYWls&s=myacct”

It´s best to use the inveretd commas (in case of any weird characters that would be interpreted another way).

Good luck.
Rosika

01101111 · July 31, 2020, 4:05pm

i was able to grab another copy with epiphany (browser - aka Gnome Web) when wget wouldn’t work. after i got your response i tried to copy and paste with the quotes you used. i believe that yielded the same html doc. curl refused to work with those quotes. eventually wget with single quotes gave different terminal output, but the same html doc. curl also ran with single quotes, but by that point it appears that the link timed out again.

i remember sending something to myself with firefox send and one of the options was to set either a time limit or number of tries. it is possible this link is using a similar method or methods. while frustrating in this particular instance, it makes sense to not give away bandwidth when the same user should just be able to make copies of the document

in spite of all of that, i did run md5sum on the copies i got from firefox and epiphany with an interesting difference:

$ md5sum *.pdf
6ec4ff88e8884c61587e124af2e6181d epiphany-w_linu01.pdf
da65d66d0dfd995d7fd4f7e7327506b3 ff-document.pdf

my firefox (ff) result matches yours from wget and curl

da65d66d0dfd995d7fd4f7e7327506b3 ff-document.pdf

but differs from my epiphany result which matches yours from chromium:

6ec4ff88e8884c61587e124af2e6181d epiphany-w_linu01.pdf

the file sizes from pdfinfo also match across those two different sets.

it would have been interesting to see what my wget and curl sums were, but i think that throws at least a minor monkey wrench in the working hypothesis that it is just a difference between the browser and terminal fetched documents.

Rosika · July 31, 2020, 4:34pm

Hi again,

I´m so sorry that neither wget nor curl wold work the way we intended.
After all you put so much effort into trying. It´s a real shame.

O.K. Some success at last. Phew.

Your findings are very interesing indeed.

First of all: There´s a difference when downloading with different devices/browsers. That confirms my findings. Great.

And then:

Wow. Not sure what to make of that but it´s very interesting.

It gets better and better.

O.K. We have verified that the PDFs are (slightly) different with your method as well. That´s really great.
At least it´s safe to say we have done as much as we could.

Thank you so much for your time and effort in investiganting that matter.
And: so sorry that my links didn´t work the way they should have.

Many greetings.
Rosika

Akito · August 1, 2020, 12:01am

I forgot that you can actually look at it from a Base16 perspective.
Since it is almost certain that something gets either pre- or appended to the PDF file through wget and curl, you can look at what difference between the first and last 16 bytes of the PDF. There should be your answer.

01101111 · August 1, 2020, 10:32am

nice call on the hex viewer. interesting tool

the copy i downloaded with epiphany:

000ea340│ 78 72 65 66 0d 0a 31 31 ┊ 36 0d 0a 25 25 45 4f 46 │xref__11┊6__%%EOF│
000ea350│ 0d 0a                   ┊                         │__      ┊        │

the one i got with firefox:

000ea340│ 78 72 65 66 0d 0a 31 31 ┊ 36 0d 0a 25 25 45 4f 46 │xref__11┊6__%%EOF│
000ea350│ 0d 0a 0a 3c 2f 62 6f 64 ┊ 79 3e 0a 3c 2f 68 74 6d │___</bod┊y>_</htm│
000ea360│ 6c 3e                   ┊                         │l>      ┊        │

so basically: /body>_</html (#if i add the _< and >, everything but the _ disappears).

Rosika · August 1, 2020, 12:15pm

Hi @Akito,
what a wonderful idea.
Lacking the underlying knowledge (haven´t had used a hexviewer so far ) it would never have occurred to me to use it.

Yes, I thought so, too.
Thanks a lot for the hint.

Greetings.
Rosika

Rosika · August 1, 2020, 12:25pm

Hi,

Following your example with my on-board hexviewer xxd I could successfully replicate your findngs.
I did the following:

xxd browser_Linux_from_Scratch.pdf output1.txt
xxd wget_Linux_from_Scratch.pdf output2.txt

and then:

diff output1.txt output2.txt 
59958c59958,59959
< 000ea350: 0d0a                                     ..
---
> 000ea350: 0d0a 0a3c 2f62 6f64 793e 0a3c 2f68 746d  ...</body>.</htm
> 000ea360: 6c3e                                     l>

Wonderful.

There´s our answer at last.

I have no idea why that difference is and what effect it has however. At least it doesn´t seem to have a crucial function.

Many thanks for your help as well.

Greetings.
Rosika

Akito · August 1, 2020, 12:38pm

I still want to know the answer to the “why” of this situation.

Rosika · August 1, 2020, 12:45pm

Hi,

me too. That was my original endeavor.
But all the things I learnt by investigating that matter with all of you made it worthwhile.

Greetings.
Rosika