I wrote a download script for itsfoss community content

Rosika · November 4, 2023, 1:36pm

Hi all,

if you´re anything like me you might want do download some selected pages from the itsfoss forum for archiving purposes (like tutorials or what-have-you).

Normally I do it with the help of the command wget.
Like so:

wget -k -E "https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273

It works well if the content is somewhat limited. For very long discussions however the URL seems to change after a while. So the first page would be:

'wget -k -E "https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273.

but the next page would be:

'https://itsfoss.community/t/looking-for-a-special-linux-distro-for-installation-on-a-laptop/11273?page=2.

See the difference?

It´s the last part: ?page=2 is added.

If the conversation gets even longer then ‘2’ becomes ‘3’ etc. But it´s still the same URL - in principle.
Now I was asking myself: how would I get the whole of the page without having to initiate 2, 3 or more separate downloads…

After a lot of thinking and researching I finally managed to come up with this script:

#!/bin/bash

echo "please enter URL for itsfoss thread"
read topic
URL=$topic
OUTPUT_DIR="itsfoss_thread"
MAX_PAGES=40  # You can set an initial maximum number of pages to check

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Loop to download pages
for page in $(seq 1 $MAX_PAGES)
do
    PAGE_URL="$URL?page=$page"
    
    # Check if the page exists, and if not, break the loop
    if ! wget -q --spider "$PAGE_URL"; then
        break
    fi

    # Download the page
    wget -k -E -P "$OUTPUT_DIR" "$PAGE_URL"
done
# Sleep for 2 seconds
sleep 2

# Change to the output directory
cd $OUTPUT_DIR

# Sort and concatenate the HTML files
cat $(ls -v *.html) > together.html

This script will automate the process and allow you to download multiple pages of the discussion without manually changing the URL for each page.

The MAX_PAGES variable is set to an initial maximum number of pages to check.
You can change it to a reasonable value, and the script will keep checking for pages until it reaches a non-existing page (i.e., the forum’s last page).
You can set a higher value initially, and the script will stop when it encounters a non-existing page.
The script uses wget --spider to check if a page exists before attempting to download it. If the page does not exist, the loop breaks, and the script ends.
So the script should be able to determine the maximum number of pages and download them without the need for manual adjustments,
even if you don’t know in advance how many pages are in the discussion thread.
cat $(ls -v *.html) > together.html :
With this implementation, the ls -v command will sort the HTML files in a natural numerical order, ensuring that they are concatenated correctly even when there are both one and two-digit numbers in the filenames.
I have to admit I ´ve been discussing this particular point with ChatGPT. I doubt I would´ve come up up with it all by myself.

I hope you may find the script useful.
For me it was a good exercise in bash scripting after a while.

Many greetings from Rosika

P.S.:

The script is applicable for itsfoss.community content which is generally available.

If you want to download private discussions it would fail, I guess, as you´d have to provide your credentials to the wget command.
This may prove to be a problem…

nevj · November 4, 2023, 10:31pm

Hi Rosika,
So I guess you get an html file.
That is useful
Thanks
Neville

Sheila_Flanagan · November 4, 2023, 11:24pm

Hey, @Rosika I had to look that one up…thanks. Always enjoy learning new words.

Sheila

nevj · November 4, 2023, 11:41pm

The cat command takes its name from the word concatenate.

Rosika · November 5, 2023, 2:09pm

Hi all,

thanks for your comments.

@nevj :

Yes, that´s correct. I forgot to mention it explicitely. Sorry.

An .html file as a result suits me well as I put a user-defined action in thunar´s config which conveniently lets me view any local .html content without interfering with my main browser (firefox).
In addition to that it runs in a sandbox.
I make use of falkon for that purpose.

I defined the command for thunar this way:

lxterminal --command="/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/LINK-FARM/Skripte_für_bionic/für_html_kontextmenu.sh %f"

The bash script the command refers to looks like this:

#!/bin/bash

cp $1 /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/work2/kgw.html
firejail --net=none --noprofile --private=/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/work2 falkon "file:///home/rosika/kgw.html"

@Sheila_Flanagan :

I´m glad you found it useful to some extent, Sheila.

Many greetings to you all.

Cheers from Rosika

Rosika · November 12, 2023, 4:11pm

@Sheila_Flanagan @nevj @ all:

Hi all,

Update:

I just updated the download script to make it more convenient for daily usage.
It looks like this now:

#!/bin/bash

echo "please enter URL for itsfoss thread"
read topic
echo
echo "please select a topic name"
read tn
URL=$topic
OUTPUT_DIR="itfsoss_thread"
MAX_PAGES=40  # You can set an initial maximum number of pages to check

# Create the output directory if it doesn't exist
mkdir -p /tmp/"$OUTPUT_DIR"

# Loop to download pages
for page in $(seq 1 $MAX_PAGES)
do
    PAGE_URL="$URL?page=$page"
    
    # Check if the page exists, and if not, break the loop
    if ! wget -q --spider "$PAGE_URL"; then
        break
    fi

    # Download the page
    wget -k -E -P /tmp/"$OUTPUT_DIR" "$PAGE_URL"
done
# Sleep for 2 seconds
sleep 2

# Change to the output directory
cd /tmp/$OUTPUT_DIR

# Sort and concatenate the HTML files
cat $(ls -v *.html) > together.html

cp together.html /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/
sleep 1
rm -r /tmp/$OUTPUT_DIR
cd /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov
mv together.html $tn.html

Changes are:

ask for a topic name (the final output will have that name)
create the output directory in /tmp
copy the together.html file to a dedicated directory
rename the together.html file to the topic name
remove the temporary output folder (together with its contents) from /tmp # for good housekeeping

That should get rid of any additional tasks connected with the download procedure.

Many greetings from Rosika

P.S.:

Of course you need to modify the final output directory for your personal setup.
I just wanted to provide a practical example.