Recommendation for good duplicate finder / cleaner

jrmwalsh · April 17, 2020, 9:57am

I am looking for a good duplicate file finder that will run in Linux, that can read and locate duplicate files on external drives formatted with FAT32, NTFS, or Apples HFS+ file systems.
I can do this job under Windows using Duplicate Cleaner, but I have not been able to find a good Linux equivalent.
I need the duplicate finder to be able to export the list of duplicate files that it finds as a csv/txt file.
It also needs to be able to handle filenames in HFS+ that have commas, slashes, colons, semicolons and other odd (normally illegal) characters in them.
I don’t use these crazy characters but I often have to clean up drives and delete duplicates for other people who do not obey good filename etiquette.
It needs to be able to handle filesystems that contain up to a million or so files.
There are some excellent duplicate file finders in Windows that do this job. One example is Duplicate Cleaner, available from here:
https://www.digitalvolcano.co.uk/index.html
However, I would like to be able to do this job in Linux Mint if it can be done as easily as in Windows. It is tasks like this that have prevented me from swapping entirely to Linux.

I use Windows and Linux but there is one job that I do in WIndows that I have not been able to find a good equivalent for in Linux.

Akito · April 17, 2020, 10:03am

Could you describe what features the duplicate finders you found for Linux lack?

jrmwalsh · April 17, 2020, 10:10am

No. Sorry. It would not be a good use of our time to describe what features the various Linux file finders lack.
What I am looking for is a recommendation for a Linux file finder that is able to do the things that I have described above.
If you can recommend one (or two) I will gladly try them out.

Akito · April 17, 2020, 10:12am

I was trying to understand what features you need that those duplicate finders do not offer. The features you requested seem pretty normal and should be handled by any duplicate finder. That’s why I asked what lacks in the software you found.

https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/

jrmwalsh · April 17, 2020, 10:40am

Thanks. I had a look at that link and I’ll try DupeGuru and FSLint, however I’m pretty sure I tried them both about a year ago and they did not satisfy my needs. I don’t remember now what their limitations were, but I think they could not read all of the files in an HFS+ file system if the filenames, or folder names contained commas or slashes. I run into this problem often when I’m cleaning up old Mac Time-Machine backups from Macs that have been disposed of. Or different backups on different drives of the same Mac. They are difficult to deal with.
The most common need I have is when I have 8-10 old external USB drives that have various chaotic backup sets on them and I just need to check if all the files on (for example) drive A are also present on drive B, and if so then delete all files on Drive A if there is a copy of them on Drive B.
I need the duplicate finder to locate the duplicates and output the results as a csv or txt file.
I then use a macro that I have written to sort through the list of duplicates and delete the files from Drive A if there is a copy of the file on Drive B.
It is common for me to have results where I need to delete 500,000 files or more from one of the drives. It would take an impossibly long time to do that manually even if I could delete one every 5 seconds or so. Therefore I need an absolutely bullet-proof, reliable, automatic method of looking through each group of files and selecting which files need to be deleted. My macro does this even though it might take a hour or two.
I’ll try DupeGuru and FSLint again but I don’t think they were adequate for my needs.

jrmwalsh · April 17, 2020, 10:57am

I use a VBA macro that runs in Excel to generate a batch file of commands to delete the files that need to be deleted.
I use Excel and a VBA macro because I can write the results of every step of the comparison into cells on the Excel sheet and I can easily track that the macro is working properly.
Or if goes off the rails I can troubleshoot it easily and find which filename caused it to fail. Fails can be caused for example, by filenames that are entirely numeric with no extension. Or folder names that contain commas, or slashes or backslashes.
Or sometimes I need to examine large filesystems on old RAID arrays from dead NT4 servers. It is tricky and they often go awry.
When I swap to Linux I’ll have to work out an alternative way of doing this job.

jrmwalsh · April 17, 2020, 7:00pm

Thank you for the suggestion.

I think I tried FSLint about a year ago but ran into a problem with it, though I don’t remember exactly what the problem was.

Are you suggesting FSLint just as a suggestion, or are you absolutely sure that it will satisfy the issues I have described above?

I don’t mean to be obstructive or rude, but I have to ask this question because too many times I have had well-intentioned people suggest software that, when I try it, I find that it does not satisfy the basic needs that I have described, and it just wastes time. Then I wonder why that person even suggested it in the first place.

The whole point of asking a community of people is so that I can benefit from their experience and avoid a lot of trial and error myself.

So again, thank you for the suggestion, and are you sure that FSLint will do the things I described above?

Thank you

Michael

1crazypj · April 17, 2020, 9:11pm

This is very interesting to me. I get update errors repositories are configured multiple times’
Tried rdfind first but couldn’t get along with it so I’m going to work my way through the list and see if I can ‘find’ duplicates. Also looks like WINE has caused some issues but that’s for a different thread

jrmwalsh · April 17, 2020, 10:14pm

Thanks for your reply.
A typical example of my needs is as follows.
I took over as manager of an organisation.
As I was looking around in the computer room I saw the main server running in its rack, but in the corner of the room, on a table, I saw two old NT4 servers that were not powered up, not plugged in to anything and not running. I asked our IT support guys about them but the IT guys said they were redundant, and that I did not need to be concerned about them because all the files on the old servers had been copied to the new server. They advised that the old servers should just be dismantled and thrown away.
And besides that, they said that they did not have the old NT4 Admin passwords so the could get the oldservers to run even if they wanted to

But I am a skeptical type and I don’t just “accept” answers like that without checking.
I powered up the newer of the two old servers, bypassed the NT4 password security system and copied all the files on it to an external drive. Then I ran a duplicate check program to list all duplicate files on that old server AND the new server.
Then I ran my own macro which sorts through the groups of duplicates and then for every file that is on oldserver, if a copy of it exists on newserver, then delete the copy on oldserver. Oldserver should then only have files on it that are NOT on Newserver. Oldserver had about 600,000 files on it and Newserver had about 750,000 files.
Therefore, after the deletion macro runs there should be no files still on oldserver exceot older copies of recently-edited files.
But that was not what I found. I found about 20,000 files left on oldserver that were not on newserver!!
I weeded out about 12,000 as being unneeded junk (application install files, tmps, device drivers, etc) ,and another 5,000 as old copies of recently edited files, but there were still about 3,000 files left. They turned out to be extremely important financial and operating data and reports that had not been copied from oldserver to newserver, despite the confident assurances from my IT support people that all files had been copied across.
These uncopied reports and financial files contained critical information that the financial future of this $200 M/year company depended on.
Hence my skepticism was justified, as was my need for a robust and reliable duplicate finder.
Duplicate finders are NOT just for liberating file space. They are extremely important forensic tools for examining large numbers of files.