Wget: ERROR 403: Forbidden

Rosika · April 12, 2023, 12:35pm

Hi all,

not an actual problem here, as I´ve found a solution/workaround already, but still: I´d like to know what you might think of the following:

On Linux Command Line Cheat Sheets @pdecker was kind enough to tell us about “Linux Command Line Cheat Sheets”. Thanks a lot for that.

I looked it up and I think it´s a great summary of all the important stuff I may need.

For easy future reference I tried downloading the page with wget but to my astonishment I got this terminal output:

wget "https://www.stationx.net/linux-command-line-cheat-sheet/"

--2023-04-12 14:13:06--  https://www.stationx.net/linux-command-line-cheat-sheet/
Resolving www.stationx.net (www.stationx.net)... 194.1.147.68, 194.1.147.28
Connecting to www.stationx.net (www.stationx.net)|194.1.147.68|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-04-12 14:13:07 ERROR 403: Forbidden.

So it wouldn´t work.

Well, I then employed lynx for that matter (I´d like to stay within the terminal) and all I needed to do was hitting the “d” key from the lynx history and
I immediately got a nice html document of the respective page which I could easily save.

So lynx doesn´t seem to have any problems with downloading the page but wget doesn´t have the right permissions

That´s odd as I normally employ wget for dowloading almost anything and to the best of my recollection I never ran into any difficulties.

Does anyone of you know what´s going on?

Many thanks in advance and many greetings.

Rosika

nevj · April 12, 2023, 1:03pm

Hi Rosika.
I am not sure this helps, but at least it defines the problem
https://www.howtogeek.com/357785/what-is-a-403-forbidden-error-and-how-can-i-fix-it/
If it works with lynx and not with wget it could still be a problem at their end, or just a mismatch between what you ask and what they allow.

You might try wget with http instead of https.
Regards
Neville

Rosika · April 12, 2023, 1:39pm

Hi Neville,

thanks a lot for your reply.

Also: thanks for the link. I read through the article. Although I think I already knew a bit about the topic I still could learn something new.

That got me thinking…
… and suddenly it occurred to me to use another user-agent (for firefox) with wget , as firefox is perfectly able to access and display the page.

And now it worked:

Here´s the terminal output:

firejail wget --debug --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36" "https://www.stationx.net/linux-command-line-cheat-sheet/"

Setting --user-agent (useragent) to Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36
DEBUG output created by Wget 1.21.2 on linux-gnu.

Reading HSTS entries from /home/rosika/.wget-hsts
URI encoding = ‘UTF-8’
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2023-04-12 15:29:40--  https://www.stationx.net/linux-command-line-cheat-sheet/
Resolving www.stationx.net (www.stationx.net)... 194.1.147.28, 194.1.147.68
Caching www.stationx.net => 194.1.147.28 194.1.147.68
Connecting to www.stationx.net (www.stationx.net)|194.1.147.28|:443... connected.
Created socket 3.
Releasing 0x0000559baa4b3e90 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 3 to SSL handle 0x0000559baa4b5df0
certificate:
  subject: CN=stationx.net
  issuer:  CN=R3,O=Let's Encrypt,C=US
X509 certificate successfully verified and matches host www.stationx.net

---request begin---
GET /linux-command-line-cheat-sheet/ HTTP/1.1
Host: www.stationx.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/83.0.4103.61 Chrome/83.0.4103.61 Safari/537.36
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK
Date: Wed, 12 Apr 2023 13:29:40 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Vary: Accept-Encoding
x-powered-by: PHP/8.1.17
last-modified: Wed, 12 Apr 2023 07:39:26 GMT
cache-control: public, max-age=0, s-maxage=3600, stale-while-revalidate=21600
expires: Wed, 12 Apr 2023 13:29:40 GMT
vary: Accept-Encoding,Origin
wpx: 1
x-turbo-charged-by: LiteSpeed
X-Edge-Location: WPX CLOUD/FF
Server: WPX CLOUD/FF
X-Cache-Status: MISS

---response end---
200 OK
Registered socket 3 for persistent reuse.
URI content encoding = ‘UTF-8’
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html              [       <=>          ] 617,71K   423KB/s    in 1,5s    

2023-04-12 15:29:42 (423 KB/s) - ‘index.html’ saved [632535]

rosika@rosika-Lenovo-H520e ~> echo $status
0

So that´s it then. If for any reason wget isn´t allowed to download something, try it with setting another user-agent.

I don´t know if it works all the time but it should be worth a try.

Many thanks again for your help, Neville.

Many greetings from Rosika

pdecker · April 12, 2023, 3:22pm

That was exactly what I was thinking. You beat me to it.

nevj · April 13, 2023, 12:00am

Hi Rosika,
Great thinking.
I am not up on user-agents… could you explain what they do?
From your output is seems they handle negotiation between your wget or browser and the website?
Regards
Neville
What is the default user-agent for wget?

kovacslt · April 13, 2023, 5:11am

I tried it.
On my server I started tail -f /var/log/nginx/acces.log, and entered a wget on my desktop against my server, and watched.
"GET /webmail/ HTTP/1.1" 200 5462 "-" "Wget/1.21"

So it seems the default user agent is Wget/<version>.

kovacslt · April 13, 2023, 5:47am

They are there to introduce your browser to the server. Based on that the server may decide to serve different content: normally, say, if there’s a feature in Mozilla, which Internet Explorer does not have, the server will render a page for Mozilla that benefits that feature, but will not try to use that feature for Internet Explorer.
This requires proper cooperation between websites developer, the hosting servers maintenence team, and of course the developers of that browser…
Sometimes this all is misused, and website owners use sniffing the user agent string just to restrict acces to content for no real reason. For example, my bank considered my that time browser, Firefox ESR too old and insecure and incompatible (wich is a wrong statement), so they served me a page that reminded me to update my browser. Of course, after I changed the user agent string, everything worked smooth, which clearly prooves that being “incompatible” is a wrong statement. They sniffed the agent string, but checked only the version number, and indeed, that was way lower than a current Firefox version, but they did not make further checks, wether it runs on Windows, is it a “normal” version, or the ESR, etc.
Sometimes the website owner tries to sniff the OS version from the user agent string, so may serve the good content only for Windows users (now I can’t tell which site this was - something entertaining/streaming) worked with Windows browsers well, but not on Linux: changing user agent string to lie about the browser (pretending running on Windows) solved the problem, and it worked without problem.
Of course, if there’s a real reason (for example the site requires a widevine level which is currently not supported on Linux), the site will not work on a Linux based browser regardless of the user agent string.

nevj · April 13, 2023, 12:34pm

@kovacslt
Hi Laszlo,
Thanks for the effort you put into that.
It looks like my webpage education needs an update.
You have given me a good start
Regards
Neville

Rosika · April 13, 2023, 12:49pm

Hi all,

@nevj :

Thanks for the compliment.

To be honest I didn´t know more about the matter than what you came up with.

But the explanation @kovacslt provided is just fantastic and marvellous.

@kovacslt :

Thanks a lot, László, for your superb explanation which clearly shows your expertise on things like that.
From it I learned a lot.

Before reading it my knowledge about user-agents was merely rudimetary.

Many thanks and many greetings to all of you.

Rosika

kovacslt · April 13, 2023, 1:32pm

You are all welcome!
…and meanwhile I could dig up in my memories that it was Disney+ which did that use-agent abuse, and there was a debate in a hungarian forum, where I got a ~~little bit~~ angry at D+, and looking at the “solution” wolrdwide prooved I wasn’t the only one thinking the same way:

“It’s looking more like a high level political problem than a technical one.”

Looking at possible implementation I could do this thing too with my NGINX instance:

But I have no intention to do such thing

Rosika · April 13, 2023, 2:38pm

Hi László,

thanks for the additional information.

I´m amazed that you could “dig up” your memories. I myself am pretty much at a loss with this kind of things. I´ll have to “dig” on my computer to find anything I dealt with in the past.

I´ll try to make (and keep) notes of those things.

Many greetings from Rosika

pdecker · April 13, 2023, 6:53pm

It’s interesting to look at our web servers logs and see the mix of OS and browsers and versions. There are lots and lots of bots out there too. Googlebot, Bingbot, Slurp (Yahoo), Yandex all identify themselves with a user agent string. We use Pingdom to monitor our web sites and they have a user agent string.

Some people will be freaked out that companies are gathering information but I really think this is justified. You have to build websites with your audience in mind. The freaky thing would be tying it to you personally.

Utilities like curl and wget have their own user agent but that can be customized, obviously.