Check Duplicates on lots of files

tutils · Post by **tutils** » 2005 Jan 25, 13:24

I'm adding about 215k files to the scrap container, and run the Duplicates Checker, based on content. After a (long) while, the operation ends, but no files are selected, although I know for sure there are some duplicates. Running the same operation on less files does work successfully.

Post by **nikos** » 2005 Jan 25, 18:43

215K files? nice one!
what's the free memory then?

unless you are running out of memory i can't think of any reason why the dupechecker would fail. What's the message you're getting upon completion (if you miss it you can double-check with Help | Last error)

Note that duplicates aren't selected unless you check the relevant option in the dialog

tutils · Post by **tutils** » 2005 Jan 25, 21:02

Thanks for the fast reply. I did what you said, here are my findings.

While the operation is running, x2 takes up to 417MB of memory (7 threads). I have 1GB installed on this computer, and overall about 850MB are used by all applications (including x2 when it is inflated to 417MB). When the operation ends, no error message appears. The Help -> Last Error option does nothing. It's like the thread that does the comparison just dies unexpectedly.

I also noticed that this same behavior occurs when doing size-based comparison, so the issue must be the large number of files.

I found a workaround - I add files to the scrap based on size filter, so less files exist, and I repeat until all files have passed through the checker. So this problem is no longer relevant for me, though the bug still exists. I will gladly help you debug it if needed.

tutils · Post by **tutils** » 2005 Jan 26, 08:27

If I may, I would like to add a suggestion. I noticed that the comparison operation is based on two stages - first it calculates a hash signature of all the files, and then it compares the signatures.

It would be nice and more time preserving to add an option to cache the hashing process result. That way, if I run the dupcheck on the same files, the first stage can be skipped.

Another idea, not sure if you're already doing it. When running a comparison based on content, first run a size-based comparison, and then calculate the hash only of the files with similar size. This would spare calculating the hash of files which obviously don't match.

While we're at it, the scrap window seems to hang while calculating the hash signatures (at least with a lot of files), taking the main window with it.

BTW: My operating-system: XP SP2.

x2 Rocks!

Post by **nikos** » 2005 Jan 26, 09:20

When the operation ends, no error message appears

so what's the state of the window then? Do you see any of these bands that group duplicate files? (also check for the filter indicator on the status bar)

add an option to cache the hashing process result

this would most definitely be the case if x2 was a tool specific for finding duplicate files but i don't think it fits within the present file manager viewpoint where checksums are merely another kind of file attribute

first run a size-based comparison, and then calculate the hash only of the files with similar size

i thought i was doing that but it turns out that i don't!
now i've changed the program so that when you check "content" it also checks "size" so this optimization is done automatically
thanks for the tip!

ps the duplicates checker is running at full foreground mode, no threads are involved. If you want to continue with file management you can clone another instance (main window menu) before you start the check for duplicates

narayan · Post by **narayan** » 2005 Jan 27, 04:58

Tutlis, does the DupChecker work with small collection having known duplicates? Try by creating some duplicates (Select some items and use the Edit | Duplicate menu. Then rename the duplicate if you want.)

Sometimes the failure to detect is due to wrong method/setting, so it is better to check them out!

tutils · Post by **tutils** » 2005 Jan 28, 14:05

nikos wrote:what's the state of the window then? Do you see any of these bands that group duplicate files? (also check for the filter indicator on the status bar)

No bands, no filter indicator.

nikos wrote:i don't think it fits within the present file manager viewpoint where checksums are merely another kind of file attribute

Are you using plain checksum for the comparison, or a more advanced hashing algorighm (like MD5, SHA1, etc)?

nikos wrote:i thought i was doing that but it turns out that i don't!
now i've changed the program so that when you check "content" it also checks "size" so this optimization is done automatically
thanks for the tip!

Excellent! I'm glad to help such a wonderful creation.

nikos wrote:the duplicates checker is running at full foreground mode, no threads are involved. If you want to continue with file management you can clone another instance (main window menu) before you start the check for duplicates

In my opinion, you should use more threads (more than even one "helper" thread) -- this is the true strength of Windows as a multi-tasking operating-system, why not taking advantage of it?

narayan wrote:does the DupChecker work with small collection having known duplicates?

Yes, as I posted earlier, cutting down the number of files to compare results in a successful operation.

Post by **nikos** » 2005 Jan 28, 14:11

the checksum is a plain one, just an addition
less representative but faster to calculate!

In my opinion, you should use more threads

<lame_excuse>
with all these shell com threading models, and being just the single hand in the programming "team", sometimes i opt for cutting corners
</lame_excuse>

but in essense a cloned window is a fresh thread, only you have to remember to clone it beforehand!

tutils · Post by **tutils** » 2005 Jan 28, 14:19

nikos wrote:the checksum is a plain one, just an addition
less representative but faster to calculate!

The problem with checksum is that it's not foolproof. If you take two similar files, and then increase the first character of the first file by one, and decrease its second character by one, a checksum comparison will claim they match, although they obviously don't.

However, I understand the speed issue that you mentioned. Perhaps offer it as an option? Even CRC16/32 is much more secure, and still very fast.

nikos wrote:<lame_excuse>
with all these shell com threading models, and being just the single hand in the programming "team", sometimes i opt for cutting corners
</lame_excuse>

No need to apologize! You did a great job so far, and I'm sure you will keep improving this baby.

Post by **nikos** » 2005 Jan 28, 14:25

i'll have a look at these simpler-yet-fast checksums you mentioned, do you have a URL handy?

still the dupechecker uses the checksum only as a quick (negative) check
if checksums are identical then it will do a byte-for-byte cross-check

tutils · Post by **tutils** » 2005 Jan 28, 21:38

Sure, assuming you want a c++ implementation, here's a couple of links I found on Google:

http://www.codeproject.com/cpp/crc32.asp
http://www.createwindow.com/programming/crc32/

However, I didn't know you run a full byte-for-byte comparison if the checksum matches -- this is the most reliable method possible. In this case, don't bother changing it. However, if you want to skip the full check, then use a hashing algorithm (MD5 is my recommendation), or replace the checksum with a CRC check, if it's fast enough.

sabadell · Post by **sabadell** » 2005 Mar 02, 18:09

If I can add 2 cents to this topic, I would prefer the CRC-32 algorithm to be implemented because its the same that uses WinZIP and WinRAR. So, you can easily check if a file in a folder is the same or not that other included in a ZIP or RAR archive.

tutils wrote:Sure, assuming you want a c++ implementation, here's a couple of links I found on Google:

http://www.codeproject.com/cpp/crc32.asp
http://www.createwindow.com/programming/crc32/

However, I didn't know you run a full byte-for-byte comparison if the checksum matches -- this is the most reliable method possible. In this case, don't bother changing it. However, if you want to skip the full check, then use a hashing algorithm (MD5 is my recommendation), or replace the checksum with a CRC check, if it's fast enough.