Well, I then employed lynx for that matter (I´d like to stay within the terminal) and all I needed to do was hitting the “d” key from the lynx history and
I immediately got a nice html document of the respective page which I could easily save.
So lynx doesn´t seem to have any problems with downloading the page but wget doesn´t have the right permissions
That´s odd as I normally employ wget for dowloading almost anything and to the best of my recollection I never ran into any difficulties.
Also: thanks for the link. I read through the article. Although I think I already knew a bit about the topic I still could learn something new.
That got me thinking…
… and suddenly it occurred to me to use another user-agent (for firefox) with wget , as firefox is perfectly able to access and display the page.
And now it worked:
Here´s the terminal output:
firejail wget --debug --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36" "https://www.stationx.net/linux-command-line-cheat-sheet/"
Setting --user-agent (useragent) to Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36
DEBUG output created by Wget 1.21.2 on linux-gnu.
Reading HSTS entries from /home/rosika/.wget-hsts
URI encoding = ‘UTF-8’
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2023-04-12 15:29:40-- https://www.stationx.net/linux-command-line-cheat-sheet/
Resolving www.stationx.net (www.stationx.net)... 194.1.147.28, 194.1.147.68
Caching www.stationx.net => 194.1.147.28 194.1.147.68
Connecting to www.stationx.net (www.stationx.net)|194.1.147.28|:443... connected.
Created socket 3.
Releasing 0x0000559baa4b3e90 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 3 to SSL handle 0x0000559baa4b5df0
certificate:
subject: CN=stationx.net
issuer: CN=R3,O=Let's Encrypt,C=US
X509 certificate successfully verified and matches host www.stationx.net
---request begin---
GET /linux-command-line-cheat-sheet/ HTTP/1.1
Host: www.stationx.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Wed, 12 Apr 2023 13:29:40 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Vary: Accept-Encoding
x-powered-by: PHP/8.1.17
last-modified: Wed, 12 Apr 2023 07:39:26 GMT
cache-control: public, max-age=0, s-maxage=3600, stale-while-revalidate=21600
expires: Wed, 12 Apr 2023 13:29:40 GMT
vary: Accept-Encoding,Origin
wpx: 1
x-turbo-charged-by: LiteSpeed
X-Edge-Location: WPX CLOUD/FF
Server: WPX CLOUD/FF
X-Cache-Status: MISS
---response end---
200 OK
Registered socket 3 for persistent reuse.
URI content encoding = ‘UTF-8’
Length: unspecified [text/html]
Saving to: ‘index.html’
index.html [ <=> ] 617,71K 423KB/s in 1,5s
2023-04-12 15:29:42 (423 KB/s) - ‘index.html’ saved [632535]
rosika@rosika-Lenovo-H520e ~> echo $status
0
So that´s it then. If for any reason wget isn´t allowed to download something, try it with setting another user-agent.
I don´t know if it works all the time but it should be worth a try.
Hi Rosika,
Great thinking.
I am not up on user-agents… could you explain what they do?
From your output is seems they handle negotiation between your wget or browser and the website?
Regards
Neville
What is the default user-agent for wget?
I tried it.
On my server I started tail -f /var/log/nginx/acces.log, and entered a wget on my desktop against my server, and watched. "GET /webmail/ HTTP/1.1" 200 5462 "-" "Wget/1.21"
So it seems the default user agent is Wget/<version>.
They are there to introduce your browser to the server. Based on that the server may decide to serve different content: normally, say, if there’s a feature in Mozilla, which Internet Explorer does not have, the server will render a page for Mozilla that benefits that feature, but will not try to use that feature for Internet Explorer.
This requires proper cooperation between websites developer, the hosting servers maintenence team, and of course the developers of that browser…
Sometimes this all is misused, and website owners use sniffing the user agent string just to restrict acces to content for no real reason. For example, my bank considered my that time browser, Firefox ESR too old and insecure and incompatible (wich is a wrong statement), so they served me a page that reminded me to update my browser. Of course, after I changed the user agent string, everything worked smooth, which clearly prooves that being “incompatible” is a wrong statement. They sniffed the agent string, but checked only the version number, and indeed, that was way lower than a current Firefox version, but they did not make further checks, wether it runs on Windows, is it a “normal” version, or the ESR, etc.
Sometimes the website owner tries to sniff the OS version from the user agent string, so may serve the good content only for Windows users (now I can’t tell which site this was - something entertaining/streaming) worked with Windows browsers well, but not on Linux: changing user agent string to lie about the browser (pretending running on Windows) solved the problem, and it worked without problem.
Of course, if there’s a real reason (for example the site requires a widevine level which is currently not supported on Linux), the site will not work on a Linux based browser regardless of the user agent string.
@kovacslt
Hi Laszlo,
Thanks for the effort you put into that.
It looks like my webpage education needs an update.
You have given me a good start
Regards
Neville
You are all welcome!
…and meanwhile I could dig up in my memories that it was Disney+ which did that use-agent abuse, and there was a debate in a hungarian forum, where I got a little bit angry at D+, and looking at the “solution” wolrdwide prooved I wasn’t the only one thinking the same way:
“It’s looking more like a high level political problem than a technical one.”
Looking at possible implementation I could do this thing too with my NGINX instance:
I´m amazed that you could “dig up” your memories. I myself am pretty much at a loss with this kind of things. I´ll have to “dig” on my computer to find anything I dealt with in the past.
I´ll try to make (and keep) notes of those things.
It’s interesting to look at our web servers logs and see the mix of OS and browsers and versions. There are lots and lots of bots out there too. Googlebot, Bingbot, Slurp (Yahoo), Yandex all identify themselves with a user agent string. We use Pingdom to monitor our web sites and they have a user agent string.
Some people will be freaked out that companies are gathering information but I really think this is justified. You have to build websites with your audience in mind. The freaky thing would be tying it to you personally.
Utilities like curl and wget have their own user agent but that can be customized, obviously.