Enhanced Download Script for It's FOSS Community Topics

Hi all, :waving_hand:

this is a follow-up tutorial to
I wrote a download script for itsfoss community content (which I published last November).

I´ve been working together with ChatGPT for 3 days (well, afternoons, actually) to concoct a script which caters for bulk-downloading selected ITSFoss forum content.
It´s the umpteenth version of the script, and it seems to work perfectly now. :wink:

In case anyone might be interested in it I thought it would be a good idea to publish it here. Perhaps it might be of use to somebody else beside me.

Preliminary Notes:

  1. I created a dedicated bookmarks folder “att_läsa” in my waterfox browser. This is where all the ITSFoss forum links go. Currently it holds 194 links. Therefore the batch-download script comes in handy. :wink:

I saved the URL collection into a text-file by following this procedure:
In waterfox / firefox: Settings —> Bookmarks —> Manage Bookmarks —> “att_läsa”
—> copy.

I saved the resulting text-file as zu_sichern.txt in a dedicated folder. Please adapt the path to your personal needs.

The script is adapted to my personal setup, i.e. paths and settings are for my system.
But you can easily adapt it to your own needs (see: point 12 “Confiugrable paths ans limits”) in the section below.

Aims and capabilities of the script:

:compass: Overview

This Bash script automates downloading, merging, and saving public ITSFoss Community forum threads (Discourse-powered) from a list of URLs.
It handles multiple links, cleans inputs, detects accessibility, merges paginated discussions, skips duplicates, and reports results neatly.

:gear: Key Features (Version 2 — “Stable Skip + Summary” Edition)

:receipt: 1. Batch processing

  • Reads a list of URLs from a plain text file ($INPUT_FILE).
  • Ignores blank lines and comments (# ...).
  • Processes each URL sequentially with numbered progress display.

:globe_showing_europe_africa: 2. Automatic public-access detection

  • Uses wget --spider to check if a page is publicly reachable before downloading.
  • If it’s a private or lounge-restricted topic, it’s safely skipped and logged in the “failed/unreachable” list.

:link: 3. URL cleanup

  • Automatically removes trailing post numbers such as /5 or /42.
    → Ensures it always downloads the entire thread, not just one post.
  • Extracts the “slug” (thread title) and numeric ID to generate safe filenames.

:brick: 4. Filename generation and sanitizing

  • Creates human-readable filenames like:
    installing-a-home-server-any-advice.html
  • Replaces unsafe characters with underscores.
  • Ensures all output files are saved under the chosen destination directory.

:package: 5. Duplicate detection & skipping

  • Before downloading, checks whether the final output file already exists.
  • If so:
    • It skips downloading again (to save bandwidth and time).
    • Logs that event in the summary under “Skipped duplicates.”
  • Keeps existing good copies untouched — ideal for incremental re-runs.

:page_facing_up: 6. Paginated thread handling

  • Supports multi-page discussions automatically.
  • Iteratively downloads up to $MAX_PAGES pages (default = 40).
  • Stops when it encounters a non-existent page (clean stop condition).
  • Each page is fetched with the correct ?page=N query parameter.

:puzzle_piece: 7. Automatic page merging

  • After all pages are downloaded:
    • Sorts them in natural order (ls -v).
    • Concatenates them into one combined file zusammen.html (i.e. together.html)
    • Moves that merged file to the output directory with the final name.
  • Deletes temporary files afterward.

:broom: 8. Temporary directory management

  • Creates a unique subdirectory for each thread under /tmp.
  • Cleans it up automatically after successful (or failed) processing.

:stopwatch: 9. Progress display

  • Shows a clear header:
🔹 Processing 5 of 8
URL: https://itsfoss.community/t/example/1234
  • Displays per-page progress:
- Downloading page 1...
- Downloading page 2...
- No page 3 (end detected).
  • Keeps the process transparent and easy to follow.

:abacus: 10. Comprehensive summary report

At the end of the run, the script prints:

✅ Saved files (5):
   - <list of saved .html files>

⚠️ Skipped duplicates (1):
   - <URL → filename>

⚠️ Failed / private / unreachable (2):
   - <list of URLs>

This gives you a concise record of everything that succeeded, was skipped, or failed.

:fire_extinguisher: 11. Safe to interrupt

  • You can abort anytime with Ctrl+C.
  • Partial temp data (in /tmp) is auto-cleaned; no risk to your system or other files.

:toolbox: 12. Configurable paths and limits

Variables at the top can easily be adjusted:

INPUT_FILE="/path/to/url-list.txt"
OUTPUT_BASE="/path/to/save/htmls"
TMP_ROOT="/tmp"
MAX_PAGES=40

So you can change where URLs come from, where output goes, and how many pages to check.

:brain: 13. Clean exit codes

  • Returns 0 when everything runs normally (even if some pages were private).
  • Any critical failure in setup (like missing input file) returns non-zero for easy scripting.

And here´s the script:

#!/bin/bash
# ==================================================================
#  ITSFoss batch download — stable version with duplicate-skip + summary
#  Based on the previously tested working script (uses ?page=N)
# ==================================================================

INPUT_FILE="/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/zu_erledigen/zu_sichern.txt"
OUTPUT_BASE="/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov"
TMP_ROOT="/tmp"
MAX_PAGES=40

# --- initial checks ---
if [[ ! -f "$INPUT_FILE" ]]; then
    echo "❌ Error: Input file not found: $INPUT_FILE"
    exit 1
fi
mkdir -p "$OUTPUT_BASE"

TOTAL_URLS=$(grep -E -v '^(#|$)' "$INPUT_FILE" | wc -l)
COUNTER=0
declare -a FAILED_URLS
declare -a SKIPPED_DUPLICATES
declare -a SAVED_FILES

echo "=============================================================="
echo "📘 Starting batch download of $TOTAL_URLS ITSFoss threads..."
echo "=============================================================="
sleep 1

while IFS= read -r raw_line || [ -n "$raw_line" ]; do
    # Trim CR and spaces
    URL=$(echo "$raw_line" | tr -d '\r' | xargs)

    # Skip empties / comments
    [[ -z "$URL" || "$URL" =~ ^# ]] && continue

    ((COUNTER++))
    echo
    echo "--------------------------------------------------------------"
    echo "🔹 Processing $COUNTER of $TOTAL_URLS"
    echo "URL: $URL"

    # Clean trailing /<postnumber> like /5
    CLEAN_URL=$(echo "$URL" | sed -E 's|/[0-9]+$||')

    # Try to extract slug and id:
    SLUG=$(echo "$CLEAN_URL" | sed -nE 's|.*/t/([^/]+)(/.*)?$|\1|p')
    ID=$(echo "$CLEAN_URL" | sed -nE 's|.*/([0-9]+)$|\1|p')

    # Determine base name (prefer slug)
    if [[ -z "$SLUG" || "$SLUG" =~ ^[0-9]+$ ]]; then
        if [[ -n "$ID" ]]; then
            BASE_NAME="$ID"
        else
            BASE_NAME=$(basename "$CLEAN_URL")
        fi
    else
        BASE_NAME="$SLUG"
    fi

    # Sanitize filename
    SAFE_NAME=$(echo "$BASE_NAME" | sed 's/[^a-zA-Z0-9._-]/_/g')
    FINAL_NAME="$SAFE_NAME"

    # If file already exists -> skip (duplicate)
    OUT_PATH="$OUTPUT_BASE/${FINAL_NAME}.html"
    if [[ -f "$OUT_PATH" ]]; then
        echo "⚠️  Skipping duplicate (already exists): ${FINAL_NAME}.html"
        SKIPPED_DUPLICATES+=("$URL -> ${FINAL_NAME}.html")
        continue
    fi

    echo "➡️  Using filename: ${FINAL_NAME}.html"

    # Check public accessibility
    if ! wget -q --spider "$CLEAN_URL"; then
        echo "⚠️  This page requires login or is not publicly accessible. Skipping..."
        FAILED_URLS+=("$URL")
        continue
    fi

    # Prepare temp dir for this thread
    TMP_DIR="$TMP_ROOT/itfsoss_thread_${FINAL_NAME}_$$"
    rm -rf "$TMP_DIR"
    mkdir -p "$TMP_DIR"

    echo "⬇️  Downloading thread pages..."
    for page in $(seq 1 $MAX_PAGES); do
        PAGE_URL="${CLEAN_URL}?page=${page}"

        # If page doesn't exist, break
        if ! wget -q --spider "$PAGE_URL"; then
            if [[ $page -eq 1 ]]; then
                # Unexpected: first page not found (shouldn't happen because we spider-tested CLEAN_URL)
                echo "❌ Unexpected: page 1 not found. Skipping this URL."
                FAILED_URLS+=("$URL")
            else
                echo "   - No page $page (end detected)."
            fi
            break
        fi

        # Download page into temp dir
        echo "   - Downloading page $page..."
        wget -q -k -E -P "$TMP_DIR" "$PAGE_URL"
    done

    # Merge pages (if any)
    cd "$TMP_DIR" || { echo "❌ Could not access temp dir. Skipping."; FAILED_URLS+=("$URL"); rm -rf "$TMP_DIR"; continue; }
    if ls *.html >/dev/null 2>&1; then
        echo "🔁 Merging pages..."
        cat $(ls -v *.html) > "zusammen.html"
        cp "zusammen.html" "$OUT_PATH"
        SAVED_FILES+=("$OUT_PATH")
        echo "✅ Saved as: $OUT_PATH"
    else
        echo "⚠️  No HTML files were downloaded for this thread (possibly private)."
        FAILED_URLS+=("$URL")
    fi

    # cleanup
    cd ~
    rm -rf "$TMP_DIR"
    sleep 1

done < "$INPUT_FILE"

# === SUMMARY ===
echo
echo "=============================================================="
echo "🎉 Run complete. Processed $COUNTER of $TOTAL_URLS URLs."
echo "Files saved in: $OUTPUT_BASE"
echo "=============================================================="

if [[ ${#SAVED_FILES[@]} -gt 0 ]]; then
    echo
    echo "✅ Saved files (${#SAVED_FILES[@]}):"
    for f in "${SAVED_FILES[@]}"; do
        echo "   - $f"
    done
fi

if [[ ${#SKIPPED_DUPLICATES[@]} -gt 0 ]]; then
    echo
    echo "⚠️ Skipped duplicates (${#SKIPPED_DUPLICATES[@]}):"
    for d in "${SKIPPED_DUPLICATES[@]}"; do
        echo "   - $d"
    done
fi

if [[ ${#FAILED_URLS[@]} -gt 0 ]]; then
    echo
    echo "⚠️ Failed / private / unreachable (${#FAILED_URLS[@]}):"
    for u in "${FAILED_URLS[@]}"; do
        echo "   - $u"
    done
fi

echo
echo "=============================================================="

Here is a sample output I got from running the script taking a reduced URL-file as an experiment:

./download_from_itsfoss-c_from_list_version2.sh 
==============================================================
📘 Starting batch download of 8 ITSFoss threads...
==============================================================

--------------------------------------------------------------
🔹 Processing 1 of 8
URL: https://itsfoss.community/t/issues-with-devuan4-to-devuan5-upgrade/11421
➡️  Using filename: issues-with-devuan4-to-devuan5-upgrade.html
⬇️  Downloading thread pages...
   - Downloading page 1...
   - No page 2 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/issues-with-devuan4-to-devuan5-upgrade.html

--------------------------------------------------------------
🔹 Processing 2 of 8
URL: https://itsfoss.community/t/any-experience-with-lxle/4734
➡️  Using filename: any-experience-with-lxle.html
⬇️  Downloading thread pages...
   - Downloading page 1...
   - Downloading page 2...
   - No page 3 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/any-experience-with-lxle.html

--------------------------------------------------------------
🔹 Processing 3 of 8
URL: https://itsfoss.community/t/kernel-panic/11447
➡️  Using filename: kernel-panic.html
⚠️  This page requires login or is not publicly accessible. Skipping...

--------------------------------------------------------------
🔹 Processing 4 of 8
URL: https://itsfoss.community/t/about-kernel-panic/11449
➡️  Using filename: about-kernel-panic.html
⚠️  This page requires login or is not publicly accessible. Skipping...

--------------------------------------------------------------
🔹 Processing 5 of 8
URL: https://itsfoss.community/t/installing-a-home-server-any-advice/11860/11
➡️  Using filename: installing-a-home-server-any-advice.html
⬇️  Downloading thread pages...
   - Downloading page 1...
   - Downloading page 2...
   - Downloading page 3...
   - No page 4 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/installing-a-home-server-any-advice.html

--------------------------------------------------------------
🔹 Processing 6 of 8
URL: https://itsfoss.community/t/mini-pc-of-your-recommendation-to-work-with-linux/11452/2
➡️  Using filename: mini-pc-of-your-recommendation-to-work-with-linux.html
⬇️  Downloading thread pages...
   - Downloading page 1...
   - Downloading page 2...
   - Downloading page 3...
   - No page 4 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/mini-pc-of-your-recommendation-to-work-with-linux.html

--------------------------------------------------------------
🔹 Processing 7 of 8
URL: https://itsfoss.community/t/peppermint-devuan-install-issues/11462/4
➡️  Using filename: peppermint-devuan-install-issues.html
⬇️  Downloading thread pages...
   - Downloading page 1...
   - Downloading page 2...
   - No page 3 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/peppermint-devuan-install-issues.html

--------------------------------------------------------------
🔹 Processing 8 of 8
URL: https://itsfoss.community/t/installing-a-home-server-any-advice/11860/36
⚠️  Skipping duplicate (already exists): installing-a-home-server-any-advice.html

==============================================================
🎉 Run complete. Processed 8 of 8 URLs.
Files saved in: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov
==============================================================

✅ Saved files (5):
   - /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/issues-with-devuan4-to-devuan5-upgrade.html
   - /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/any-experience-with-lxle.html
   - /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/installing-a-home-server-any-advice.html
   - /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/mini-pc-of-your-recommendation-to-work-with-linux.html
   - /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/peppermint-devuan-install-issues.html

⚠️ Skipped duplicates (1):
   - https://itsfoss.community/t/installing-a-home-server-any-advice/11860/36 -> installing-a-home-server-any-advice.html

⚠️ Failed / private / unreachable (2):
   - https://itsfoss.community/t/kernel-panic/11447
   - https://itsfoss.community/t/about-kernel-panic/11449

==============================================================
rosika@rosika-Lenovo-H520e ~/D/K/prov2> echo $status
0

Notes:

  • As you can see the summary says that 5 files were saved as they are publicly avialable.
  • There was one duplicate in my list. Only one occurrence was downloaded and saved.
  • 2 items weren´t dealt with as they are either “lounge” or “private”.

That´s it. :wink:
It was a very interesting project to work on in cooperation with ChatGPT.

Many greetings from Rosika :slightly_smiling_face:

11 Likes

Hi Rosika,
That is a huge effort.
I shall be giving it a try, and I hope others do too.
The forum is indeed becoming an information source … people will want to store bits that are important to them .

And reminder to everyone… dont forget the search ability of discourse… it functions well.

Rsgards
Neville

3 Likes

Hi Neville, :waving_hand:

thank you very much for your kind feedback. :heart:

I´m really glad you find the topic interesting.
By changing the configurable paths and limits at the beginning of the script it should work for you and everyone else.

Thanks for pointing it out.

Many greetings from Rosika :slightly_smiling_face:

3 Likes

Disclaimer: This is not a rant.

@Rosika Sorry, but I didn’t get the point, even if I admire your great script!

What is the purpose (or advantage) of grabbing entire threads from a forum? This reminds me of the past, when internet connection time was expensive and limited, and content was downloaded for offline reading. Couldn’t something like this be realized with RSS feeds?

BTW, the forum owners might dislike the script-driven grabbing. Generally, a forum is made for interactive use.

3 Likes

@abu :

Hi Alfred, :waving_hand:

thanks for your feedback.

Well, as I have pointed out in my preceding post which I referred to earlier:

if you´re anything like me you might want do download some selected pages from the itsfoss forum for archiving purposes (like tutorials or what-have-you).

It´s for selected threads that you may take a particular interest in.
I guess I should have made it clear in the beginning:

As my internet connection is established via a web-stick (mobile network), I´m on a metered connection. So saving a bit of data when possible would be nice.

Apart from that, having a personal archive of what´s of interest to you may be a bit of a time saver.

I use my falkon browser in a firejailed sandbox without internet connection and created a dedicated context-menu entry for it in thunar.
This way accessing those archives threads is very easy.

Really :red_question_mark: :astonished_face:

Does anyone else feel that way? That would make me a bit sad. :pensive_face:
I might be a discussion in its own right.

All the pages my script deals with are publicly availabe anyway.
Private and lounge content isn´t catered for by that.

To sum up:

the script is for personal archiving only and I hope nobody would be offended by it.

Many greetings from Rosika :slightly_smiling_face:

2 Likes

I’m in no way offended by it, and I understand your goals more clearly now. Thank you!

About the grabbing: It might depend on how often you run your script. My remark on this was of a more general nature.

4 Likes

@abu :

Hi Alfred, :waving_hand:

thanks for your reply.

I´m relieved you´re not offended by me or the script in any way. :heart:

Yes, quite so.
As pointed out at the beginning, I have 194 links saved in my dedicated waterfox tab, representing all those topics that are of special interest to me.
I never found the time to archive them manually, so the bookmark folder kept filling up. :wink:

When I eventually wanted to deal with them it seemed to be a huge task. So I was looking for some sort of automation.

The script aims to deal with this situation.

Many greetings from Rosika :slightly_smiling_face:

2 Likes

Hey Rosika,
very cool script here!
I hope you don’t mind if I post an updated version of the script here, I added saving in the home folder and wget checking / installing.

#!/bin/bash
# ==================================================================
#  ITSFoss batch download — stable version with duplicate-skip + summary
#  Based on the previously tested working script (uses ?page=N)
# ==================================================================

INPUT_FILE="$HOME/itsfoss-download.txt"
OUTPUT_BASE="$HOME/Downloads"
TMP_ROOT="/tmp"
MAX_PAGES=40

if ! which wget >/dev/null; then
    echo "wget is not installed. The script will now try to install wget using a package manager."
    if command -v apt >/dev/null; then
        echo "Installing wget using apt..."
        sudo apt update
        sudo apt install -y wget
    elif command -v dnf >/dev/null; then
        echo "Installing wget using dnf..."
        sudo dnf install -y wget
    elif command -v zypper >/dev/null; then
	echo "Installing wget using zypper..."
	sudo zypper install wget
    elif command -v emerge >/dev/null; then
	echo "Emerging zypper using portage/emerge..."
	sudo emerge wget
    elif command -v pacman >/dev/null; then
	echo "Installing wget using pacman..."
	sudo pacman -Sy wget
    elif command -v apk >/dev/null; then
	echo "Installing wget using Alpine apk..."
	sudo apk add wget
    else
	echo "No package manager found. Install or build wget first before rerunning."
	exit 1
    fi
fi
	    
# --- initial checks ---
if [[ ! -f "$INPUT_FILE" ]]; then
    echo "❌ Error: Input file not found: $INPUT_FILE"
    exit 1
fi
mkdir -p "$OUTPUT_BASE"

TOTAL_URLS=$(grep -E -v '^(#|$)' "$INPUT_FILE" | wc -l)
COUNTER=0
declare -a FAILED_URLS
declare -a SKIPPED_DUPLICATES
declare -a SAVED_FILES

echo "=============================================================="
echo "📘 Starting batch download of $TOTAL_URLS It's FOSS threads..."
echo "=============================================================="
sleep 1

while IFS= read -r raw_line || [ -n "$raw_line" ]; do
    # Trim CR and spaces
    URL=$(echo "$raw_line" | tr -d '\r' | xargs)

    # Skip empties / comments
    [[ -z "$URL" || "$URL" =~ ^# ]] && continue

    ((COUNTER++))
    echo
    echo "--------------------------------------------------------------"
    echo "🔹 Processing $COUNTER of $TOTAL_URLS"
    echo "URL: $URL"

    # Clean trailing /<postnumber> like /5
    CLEAN_URL=$(echo "$URL" | sed -E 's|/[0-9]+$||')

    # Try to extract slug and id:
    SLUG=$(echo "$CLEAN_URL" | sed -nE 's|.*/t/([^/]+)(/.*)?$|\1|p')
    ID=$(echo "$CLEAN_URL" | sed -nE 's|.*/([0-9]+)$|\1|p')

    # Determine base name (prefer slug)
    if [[ -z "$SLUG" || "$SLUG" =~ ^[0-9]+$ ]]; then
        if [[ -n "$ID" ]]; then
            BASE_NAME="$ID"
        else
            BASE_NAME=$(basename "$CLEAN_URL")
        fi
    else
        BASE_NAME="$SLUG"
    fi

    # Sanitize filename
    SAFE_NAME=$(echo "$BASE_NAME" | sed 's/[^a-zA-Z0-9._-]/_/g')
    FINAL_NAME="$SAFE_NAME"

    # If file already exists -> skip (duplicate)
    OUT_PATH="$OUTPUT_BASE/${FINAL_NAME}.html"
    if [[ -f "$OUT_PATH" ]]; then
        echo "⚠️  Skipping duplicate (already exists): ${FINAL_NAME}.html"
        SKIPPED_DUPLICATES+=("$URL -> ${FINAL_NAME}.html")
        continue
    fi

    echo "➡️  Using filename: ${FINAL_NAME}.html"

    # Check public accessibility
    if ! /usr/bin/wget -q --spider "$CLEAN_URL"; then
        echo "⚠️  This page requires login or is not publicly accessible. Skipping..."
        FAILED_URLS+=("$URL")
        continue
    fi

    # Prepare temp dir for this thread
    TMP_DIR="$TMP_ROOT/itfsoss_thread_${FINAL_NAME}_$$"
    rm -rf "$TMP_DIR"
    mkdir -p "$TMP_DIR"

    echo "⬇️  Downloading thread pages..."
    for page in $(seq 1 $MAX_PAGES); do
        PAGE_URL="${CLEAN_URL}?page=${page}"

        # If page doesn't exist, break
        if ! /usr/bin/wget -q --spider "$PAGE_URL"; then
            if [[ $page -eq 1 ]]; then
                # Unexpected: first page not found (shouldn't happen because we spider-tested CLEAN_URL)
                echo "❌ Unexpected error: page 1 not found. Skipping this URL."
                FAILED_URLS+=("$URL")
            else
                echo "   - No page $page (end detected)."
            fi
            break
        fi

        # Download page into temp dir
        echo "   - Downloading page $page..."
        wget -q -k -E -P "$TMP_DIR" "$PAGE_URL"
    done

    # Merge pages (if any)
    cd "$TMP_DIR" || { echo "❌ Could not access temp dir. Skipping."; FAILED_URLS+=("$URL"); rm -rf "$TMP_DIR"; continue; }
    if ls *.html >/dev/null 2>&1; then
        echo "🔁 Merging pages..."
        cat $(ls -v *.html) > "zusammen.html"
        cp "zusammen.html" "$OUT_PATH"
        SAVED_FILES+=("$OUT_PATH")
        echo "✅ Saved as: $OUT_PATH"
    else
        echo "⚠️  No HTML files were downloaded for this thread (possibly private)."
        FAILED_URLS+=("$URL")
    fi

    # cleanup
    cd ~
    rm -rf "$TMP_DIR"
    sleep 1

done < "$INPUT_FILE"

# === SUMMARY ===
echo
echo "=============================================================="
echo "🎉 Run complete. Processed $COUNTER of $TOTAL_URLS URLs."
echo "Files saved in: $OUTPUT_BASE"
echo "=============================================================="

if [[ ${#SAVED_FILES[@]} -gt 0 ]]; then
    echo
    echo "✅ Saved files (${#SAVED_FILES[@]}):"
    for f in "${SAVED_FILES[@]}"; do
        echo "   - $f"
    done
fi

if [[ ${#SKIPPED_DUPLICATES[@]} -gt 0 ]]; then
    echo
    echo "⚠️ Skipped duplicates (${#SKIPPED_DUPLICATES[@]}):"
    for d in "${SKIPPED_DUPLICATES[@]}"; do
        echo "   - $d"
    done
fi

if [[ ${#FAILED_URLS[@]} -gt 0 ]]; then
    echo
    echo "⚠️ Failed / private / unreachable (${#FAILED_URLS[@]}):"
    for u in "${FAILED_URLS[@]}"; do
        echo "   - $u"
    done
fi

echo
echo "=============================================================="
3 Likes

I archive my stuff on my self-hosted Wallabag instance; thus, my bookmark folder isn’t involved.
But my process could sometimes really benefit from some automation. :wink:

3 Likes

SCNR: Installing the required toolset doesn’t belong in this script. This disregards the single-responsibility principle.
Sorry, just my 2 cents.

2 Likes

That is unfortunately still reality for many parts of the world. People still obtain Linux on CD’s or on flash drives.

2 Likes

I really wasn’t aware of this fact. You have to do things differently when all you have is mobile internet.

3 Likes

I used to be a member of another Discourse forum 10 years ago… it was focussed on NextThing CHIP and PocketCHIP… But 'cause they ran Debian Jessie - Linux was often a focus of that forum…

When the company folded - the Discourse forum went away about 3-6 months later…

I think someone managed to grab an archive of it… But even that has had to re-host a few times and I don’t recall if it’s still available as an archive…

But as you can see from my example - scraping content from a forum does serve a use case.

BTW - I’ve never used “RSS Feeds” and I don’t really understand the whole concept of that…

Great script @Rosika - one of these days I’ll have a look at it…

2 Likes

That sort of thing used to be the story of my life. Every time a mainframe changed there was a massive file conversion activity. Inevitably some things were lost.
Storage media were forever changing…cards, 7 track tape, 9 track tape, cassette tape, exabyte tape, CD, DVD, BlueRay, flashdrive, usbdrive, pluggable sata, internet, …
You cant rely on any internet site or any technology staying put forever. You have to hedge.

I keep some of my more informative topics/replies on my Github site. I keep them in markdown… because I have the source. I dont think you can recover markdown from itsfoss forum, so I think @Rosika 's HTML is a reasonable choice.

2 Likes

Yes, for keeping particular posts containing essential information. But entire threads…

I won’t do it without them. I’m reading tons of blogs and news sites via RSS. You might give it a try.

1 Like

Hi everybody, :waving_hand:

thank you so much for you kind feedback. I was delighted to see so many responses. :heart:

@George1 :

Of course. Feel free to alter anything to your liking.

Anyone using the script would have to adapt the INPUT_FILE and
OUTPUT_BASE variables.
Your approach:

INPUT_FILE="$HOME/itsfoss-download.txt"
OUTPUT_BASE="$HOME/Downloads"

makes a lot of sense.

I have to admit I haven´t heard of the command command before. :wink:
I looked it up in the man pages.
Amazing. You never stop learning. Thanks George.

@abu :

Sounds cool, but it was new to me. I searched for it and found:

Wallabag is a self-hosted application that allows users to save web pages for later reading. It is open source and free, enabling users to classify and access their saved articles on various devices.

O.K., this sounds like a very professional tool. You can certainly work well with it. Thanks for the info.

@nevj :

Thank you for bringing it to our attention, Neville.

Thanks. I´m glad you see it this way.

@daniel.m.tripp :

Thanks for your praise and for the additional information.

Yes, that was a good example, taken from real life. :+1:

Thanks again to all.

Many greetings from Rosika :slightly_smiling_face:

3 Likes

Yes, it’s a typical LAMP application that can store entire pages, including pics, but without any clutter (ads). Similar to Pocket or ReadItLater. There is also a mobile app for reading on the road.

2 Likes

I did some vibe coding with Copilot the last couple weekends and whipped up an app that uses RSS feeds. It shows them in a grid with a clickable link to the original article. It uses a responsive design to change the grid depending on screen size and works well on mobile. It has a dark or light theme button, a refresh button, and a pulldown to filter the RSS feeds shown.

One of the reasons I’ve wanted to do this for a while is to filter the Tech News I read. There are so many stories I am not interested in seeing but am hit with constantly. I added a filter for “Apple” and “iPhone” but can see other things I might want to filter too.

It is running on my old Dell XPS Studio with the original Intel Core i7 processor and exposed to the web through a Tailscale funnel, so it might not be up and running all the time.

Please check it out here:

3 Likes

All feed readers I know are capable of filtering. So there is no need to write my own.

2 Likes

Where’s the fun in that? :wink:

4 Likes