Hi all,
if you´re anything like me you might want do download some selected pages from the itsfoss forum for archiving purposes (like tutorials or what-have-you).
Normally I do it with the help of the command wget
.
Like so:
wget -k -E "https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273
It works well if the content is somewhat limited. For very long discussions however the URL seems to change after a while. So the first page would be:
'wget -k -E "https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273
.
but the next page would be:
'https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273?page=2
.
See the difference?
It´s the last part: ?page=2
is added.
If the conversation gets even longer then ‘2’ becomes ‘3’ etc. But it´s still the same URL - in principle.
Now I was asking myself: how would I get the whole of the page without having to initiate 2, 3 or more separate downloads…
After a lot of thinking and researching I finally managed to come up with this script:
#!/bin/bash
echo "please enter URL for itsfoss thread"
read topic
URL=$topic
OUTPUT_DIR="itsfoss_thread"
MAX_PAGES=40 # You can set an initial maximum number of pages to check
# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Loop to download pages
for page in $(seq 1 $MAX_PAGES)
do
PAGE_URL="$URL?page=$page"
# Check if the page exists, and if not, break the loop
if ! wget -q --spider "$PAGE_URL"; then
break
fi
# Download the page
wget -k -E -P "$OUTPUT_DIR" "$PAGE_URL"
done
# Sleep for 2 seconds
sleep 2
# Change to the output directory
cd $OUTPUT_DIR
# Sort and concatenate the HTML files
cat $(ls -v *.html) > together.html
This script will automate the process and allow you to download multiple pages of the discussion without manually changing the URL for each page.
-
The MAX_PAGES variable is set to an initial maximum number of pages to check.
You can change it to a reasonable value, and the script will keep checking for pages until it reaches a non-existing page (i.e., the forum’s last page).
You can set a higher value initially, and the script will stop when it encounters a non-existing page. -
The script uses
wget --spider
to check if a page exists before attempting to download it. If the page does not exist, the loop breaks, and the script ends. -
So the script should be able to determine the maximum number of pages and download them without the need for manual adjustments,
even if you don’t know in advance how many pages are in the discussion thread. -
cat $(ls -v *.html) > together.html
: -
With this implementation, the
ls -v
command will sort the HTML files in a natural numerical order, ensuring that they are concatenated correctly even when there are both one and two-digit numbers in the filenames.
I have to admit I ´ve been discussing this particular point with ChatGPT. I doubt I would´ve come up up with it all by myself.
I hope you may find the script useful.
For me it was a good exercise in bash scripting after a while.
Many greetings from Rosika
P.S.:
The script is applicable for itsfoss.community
content which is generally available.
If you want to download private discussions it would fail, I guess, as you´d have to provide your credentials to the wget
command.
This may prove to be a problem…