Hi all, ![]()
this is a follow-up tutorial to
I wrote a download script for itsfoss community content (which I published last November).
I´ve been working together with ChatGPT for 3 days (well, afternoons, actually) to concoct a script which caters for bulk-downloading selected ITSFoss forum content.
It´s the umpteenth version of the script, and it seems to work perfectly now. ![]()
In case anyone might be interested in it I thought it would be a good idea to publish it here. Perhaps it might be of use to somebody else beside me.
Preliminary Notes:
- I created a dedicated bookmarks folder “att_läsa” in my
waterfoxbrowser. This is where all the ITSFoss forum links go. Currently it holds 194 links. Therefore the batch-download script comes in handy.
I saved the URL collection into a text-file by following this procedure:
In waterfox / firefox: Settings —> Bookmarks —> Manage Bookmarks —> “att_läsa”
—> copy.
I saved the resulting text-file as zu_sichern.txt in a dedicated folder. Please adapt the path to your personal needs.
The script is adapted to my personal setup, i.e. paths and settings are for my system.
But you can easily adapt it to your own needs (see: point 12 “Confiugrable paths ans limits”) in the section below.
Aims and capabilities of the script:
Overview
This Bash script automates downloading, merging, and saving public ITSFoss Community forum threads (Discourse-powered) from a list of URLs.
It handles multiple links, cleans inputs, detects accessibility, merges paginated discussions, skips duplicates, and reports results neatly.
Key Features (Version 2 — “Stable Skip + Summary” Edition)
1. Batch processing
- Reads a list of URLs from a plain text file (
$INPUT_FILE). - Ignores blank lines and comments (
# ...). - Processes each URL sequentially with numbered progress display.
2. Automatic public-access detection
- Uses
wget --spiderto check if a page is publicly reachable before downloading. - If it’s a private or lounge-restricted topic, it’s safely skipped and logged in the “failed/unreachable” list.
3. URL cleanup
- Automatically removes trailing post numbers such as
/5or/42.
→ Ensures it always downloads the entire thread, not just one post. - Extracts the “slug” (thread title) and numeric ID to generate safe filenames.
4. Filename generation and sanitizing
- Creates human-readable filenames like:
installing-a-home-server-any-advice.html - Replaces unsafe characters with underscores.
- Ensures all output files are saved under the chosen destination directory.
5. Duplicate detection & skipping
- Before downloading, checks whether the final output file already exists.
- If so:
- It skips downloading again (to save bandwidth and time).
- Logs that event in the summary under “Skipped duplicates.”
- Keeps existing good copies untouched — ideal for incremental re-runs.
6. Paginated thread handling
- Supports multi-page discussions automatically.
- Iteratively downloads up to
$MAX_PAGESpages (default = 40). - Stops when it encounters a non-existent page (clean stop condition).
- Each page is fetched with the correct
?page=Nquery parameter.
7. Automatic page merging
- After all pages are downloaded:
- Sorts them in natural order (
ls -v). - Concatenates them into one combined file
zusammen.html(i.e.together.html) - Moves that merged file to the output directory with the final name.
- Sorts them in natural order (
- Deletes temporary files afterward.
8. Temporary directory management
- Creates a unique subdirectory for each thread under
/tmp. - Cleans it up automatically after successful (or failed) processing.
9. Progress display
- Shows a clear header:
🔹 Processing 5 of 8
URL: https://itsfoss.community/t/example/1234
- Displays per-page progress:
- Downloading page 1...
- Downloading page 2...
- No page 3 (end detected).
- Keeps the process transparent and easy to follow.
10. Comprehensive summary report
At the end of the run, the script prints:
✅ Saved files (5):
- <list of saved .html files>
⚠️ Skipped duplicates (1):
- <URL → filename>
⚠️ Failed / private / unreachable (2):
- <list of URLs>
This gives you a concise record of everything that succeeded, was skipped, or failed.
11. Safe to interrupt
- You can abort anytime with
Ctrl+C. - Partial temp data (in
/tmp) is auto-cleaned; no risk to your system or other files.
12. Configurable paths and limits
Variables at the top can easily be adjusted:
INPUT_FILE="/path/to/url-list.txt"
OUTPUT_BASE="/path/to/save/htmls"
TMP_ROOT="/tmp"
MAX_PAGES=40
So you can change where URLs come from, where output goes, and how many pages to check.
13. Clean exit codes
- Returns
0when everything runs normally (even if some pages were private). - Any critical failure in setup (like missing input file) returns non-zero for easy scripting.
And here´s the script:
#!/bin/bash
# ==================================================================
# ITSFoss batch download — stable version with duplicate-skip + summary
# Based on the previously tested working script (uses ?page=N)
# ==================================================================
INPUT_FILE="/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/zu_erledigen/zu_sichern.txt"
OUTPUT_BASE="/media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov"
TMP_ROOT="/tmp"
MAX_PAGES=40
# --- initial checks ---
if [[ ! -f "$INPUT_FILE" ]]; then
echo "❌ Error: Input file not found: $INPUT_FILE"
exit 1
fi
mkdir -p "$OUTPUT_BASE"
TOTAL_URLS=$(grep -E -v '^(#|$)' "$INPUT_FILE" | wc -l)
COUNTER=0
declare -a FAILED_URLS
declare -a SKIPPED_DUPLICATES
declare -a SAVED_FILES
echo "=============================================================="
echo "📘 Starting batch download of $TOTAL_URLS ITSFoss threads..."
echo "=============================================================="
sleep 1
while IFS= read -r raw_line || [ -n "$raw_line" ]; do
# Trim CR and spaces
URL=$(echo "$raw_line" | tr -d '\r' | xargs)
# Skip empties / comments
[[ -z "$URL" || "$URL" =~ ^# ]] && continue
((COUNTER++))
echo
echo "--------------------------------------------------------------"
echo "🔹 Processing $COUNTER of $TOTAL_URLS"
echo "URL: $URL"
# Clean trailing /<postnumber> like /5
CLEAN_URL=$(echo "$URL" | sed -E 's|/[0-9]+$||')
# Try to extract slug and id:
SLUG=$(echo "$CLEAN_URL" | sed -nE 's|.*/t/([^/]+)(/.*)?$|\1|p')
ID=$(echo "$CLEAN_URL" | sed -nE 's|.*/([0-9]+)$|\1|p')
# Determine base name (prefer slug)
if [[ -z "$SLUG" || "$SLUG" =~ ^[0-9]+$ ]]; then
if [[ -n "$ID" ]]; then
BASE_NAME="$ID"
else
BASE_NAME=$(basename "$CLEAN_URL")
fi
else
BASE_NAME="$SLUG"
fi
# Sanitize filename
SAFE_NAME=$(echo "$BASE_NAME" | sed 's/[^a-zA-Z0-9._-]/_/g')
FINAL_NAME="$SAFE_NAME"
# If file already exists -> skip (duplicate)
OUT_PATH="$OUTPUT_BASE/${FINAL_NAME}.html"
if [[ -f "$OUT_PATH" ]]; then
echo "⚠️ Skipping duplicate (already exists): ${FINAL_NAME}.html"
SKIPPED_DUPLICATES+=("$URL -> ${FINAL_NAME}.html")
continue
fi
echo "➡️ Using filename: ${FINAL_NAME}.html"
# Check public accessibility
if ! wget -q --spider "$CLEAN_URL"; then
echo "⚠️ This page requires login or is not publicly accessible. Skipping..."
FAILED_URLS+=("$URL")
continue
fi
# Prepare temp dir for this thread
TMP_DIR="$TMP_ROOT/itfsoss_thread_${FINAL_NAME}_$$"
rm -rf "$TMP_DIR"
mkdir -p "$TMP_DIR"
echo "⬇️ Downloading thread pages..."
for page in $(seq 1 $MAX_PAGES); do
PAGE_URL="${CLEAN_URL}?page=${page}"
# If page doesn't exist, break
if ! wget -q --spider "$PAGE_URL"; then
if [[ $page -eq 1 ]]; then
# Unexpected: first page not found (shouldn't happen because we spider-tested CLEAN_URL)
echo "❌ Unexpected: page 1 not found. Skipping this URL."
FAILED_URLS+=("$URL")
else
echo " - No page $page (end detected)."
fi
break
fi
# Download page into temp dir
echo " - Downloading page $page..."
wget -q -k -E -P "$TMP_DIR" "$PAGE_URL"
done
# Merge pages (if any)
cd "$TMP_DIR" || { echo "❌ Could not access temp dir. Skipping."; FAILED_URLS+=("$URL"); rm -rf "$TMP_DIR"; continue; }
if ls *.html >/dev/null 2>&1; then
echo "🔁 Merging pages..."
cat $(ls -v *.html) > "zusammen.html"
cp "zusammen.html" "$OUT_PATH"
SAVED_FILES+=("$OUT_PATH")
echo "✅ Saved as: $OUT_PATH"
else
echo "⚠️ No HTML files were downloaded for this thread (possibly private)."
FAILED_URLS+=("$URL")
fi
# cleanup
cd ~
rm -rf "$TMP_DIR"
sleep 1
done < "$INPUT_FILE"
# === SUMMARY ===
echo
echo "=============================================================="
echo "🎉 Run complete. Processed $COUNTER of $TOTAL_URLS URLs."
echo "Files saved in: $OUTPUT_BASE"
echo "=============================================================="
if [[ ${#SAVED_FILES[@]} -gt 0 ]]; then
echo
echo "✅ Saved files (${#SAVED_FILES[@]}):"
for f in "${SAVED_FILES[@]}"; do
echo " - $f"
done
fi
if [[ ${#SKIPPED_DUPLICATES[@]} -gt 0 ]]; then
echo
echo "⚠️ Skipped duplicates (${#SKIPPED_DUPLICATES[@]}):"
for d in "${SKIPPED_DUPLICATES[@]}"; do
echo " - $d"
done
fi
if [[ ${#FAILED_URLS[@]} -gt 0 ]]; then
echo
echo "⚠️ Failed / private / unreachable (${#FAILED_URLS[@]}):"
for u in "${FAILED_URLS[@]}"; do
echo " - $u"
done
fi
echo
echo "=============================================================="
Here is a sample output I got from running the script taking a reduced URL-file as an experiment:
./download_from_itsfoss-c_from_list_version2.sh
==============================================================
📘 Starting batch download of 8 ITSFoss threads...
==============================================================
--------------------------------------------------------------
🔹 Processing 1 of 8
URL: https://itsfoss.community/t/issues-with-devuan4-to-devuan5-upgrade/11421
➡️ Using filename: issues-with-devuan4-to-devuan5-upgrade.html
⬇️ Downloading thread pages...
- Downloading page 1...
- No page 2 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/issues-with-devuan4-to-devuan5-upgrade.html
--------------------------------------------------------------
🔹 Processing 2 of 8
URL: https://itsfoss.community/t/any-experience-with-lxle/4734
➡️ Using filename: any-experience-with-lxle.html
⬇️ Downloading thread pages...
- Downloading page 1...
- Downloading page 2...
- No page 3 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/any-experience-with-lxle.html
--------------------------------------------------------------
🔹 Processing 3 of 8
URL: https://itsfoss.community/t/kernel-panic/11447
➡️ Using filename: kernel-panic.html
⚠️ This page requires login or is not publicly accessible. Skipping...
--------------------------------------------------------------
🔹 Processing 4 of 8
URL: https://itsfoss.community/t/about-kernel-panic/11449
➡️ Using filename: about-kernel-panic.html
⚠️ This page requires login or is not publicly accessible. Skipping...
--------------------------------------------------------------
🔹 Processing 5 of 8
URL: https://itsfoss.community/t/installing-a-home-server-any-advice/11860/11
➡️ Using filename: installing-a-home-server-any-advice.html
⬇️ Downloading thread pages...
- Downloading page 1...
- Downloading page 2...
- Downloading page 3...
- No page 4 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/installing-a-home-server-any-advice.html
--------------------------------------------------------------
🔹 Processing 6 of 8
URL: https://itsfoss.community/t/mini-pc-of-your-recommendation-to-work-with-linux/11452/2
➡️ Using filename: mini-pc-of-your-recommendation-to-work-with-linux.html
⬇️ Downloading thread pages...
- Downloading page 1...
- Downloading page 2...
- Downloading page 3...
- No page 4 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/mini-pc-of-your-recommendation-to-work-with-linux.html
--------------------------------------------------------------
🔹 Processing 7 of 8
URL: https://itsfoss.community/t/peppermint-devuan-install-issues/11462/4
➡️ Using filename: peppermint-devuan-install-issues.html
⬇️ Downloading thread pages...
- Downloading page 1...
- Downloading page 2...
- No page 3 (end detected).
🔁 Merging pages...
✅ Saved as: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/peppermint-devuan-install-issues.html
--------------------------------------------------------------
🔹 Processing 8 of 8
URL: https://itsfoss.community/t/installing-a-home-server-any-advice/11860/36
⚠️ Skipping duplicate (already exists): installing-a-home-server-any-advice.html
==============================================================
🎉 Run complete. Processed 8 of 8 URLs.
Files saved in: /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov
==============================================================
✅ Saved files (5):
- /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/issues-with-devuan4-to-devuan5-upgrade.html
- /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/any-experience-with-lxle.html
- /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/installing-a-home-server-any-advice.html
- /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/mini-pc-of-your-recommendation-to-work-with-linux.html
- /media/rosika/f14a27c2-0b49-4607-94ea-2e56bbf76fe1/DATEN-PARTITION/Dokumente/Ergänzungen_zu_Programmen/für_lynx/CLEAN/PRIVATE/DIR/prov/peppermint-devuan-install-issues.html
⚠️ Skipped duplicates (1):
- https://itsfoss.community/t/installing-a-home-server-any-advice/11860/36 -> installing-a-home-server-any-advice.html
⚠️ Failed / private / unreachable (2):
- https://itsfoss.community/t/kernel-panic/11447
- https://itsfoss.community/t/about-kernel-panic/11449
==============================================================
rosika@rosika-Lenovo-H520e ~/D/K/prov2> echo $status
0
Notes:
- As you can see the summary says that 5 files were saved as they are publicly avialable.
- There was one duplicate in my list. Only one occurrence was downloaded and saved.
- 2 items weren´t dealt with as they are either “lounge” or “private”.
That´s it. ![]()
It was a very interesting project to work on in cooperation with ChatGPT.
Many greetings from Rosika ![]()