Finding duplicate files based on content (2)

Robert2 · Post by **Robert2** » 2010 Aug 15, 01:15

Hi,
I am running xplorer² Pro 1.8.0.12 [Unicode].
I have 2 folders with plenty of HTML files each. Most of the files have identical content from one folder to the other.
If I use the Synch Wizard to compare the 2 folders by “content”, only a few files are reported as identical even though most of them are.
Now if I invert the selection to highlight only the files which are supposed to be different, copy those files to an empty folder, and make a new comparison with the Synch Wizard, all the copied files are reported as identical by content with the original files. This is as expected. But why on earth should xplorer² report these same files as different by content when they were located in their original folders?
At some point in the past (in 2004, see posting.php?mode=quote&p=11699), a similar bug was found and fixed. It seems the bug is back...

Post by **nikos** » 2010 Aug 15, 07:33

the sync wizard first tries to match folders and names. See if ticking 'loose name matching' box in Ctrl+F9 helps

Robert2 · Post by **Robert2** » 2010 Aug 15, 07:54

nikos wrote:See if ticking 'loose name matching' box in Ctrl+F9 helps

That option is greyed-out (unavailable) on my system (XP SP3), whichever other option is ticked in Ctrl+F9.
What gives?

Post by **nikos** » 2010 Aug 15, 10:37

you won't see this enabled for single folder comparison, only in scrap windows

what's the smallest test case that will reproduce this bug? Also enable the checksum column, some files may differ in whitespace only

Robert2 · Post by **Robert2** » 2010 Aug 15, 13:47

Maybe I should have given more details.
The 2 folders (folder panes in xplorer²) contain files with identical names. Only the content of a few files might be different from one pane to the next because of modifications. To speed work up I need to know (at one glance if possible) which files were modified (if any).
Note that the file date/time stamps are irrelevant here because one folder contains purely local files, when the other folder contains files localized (downloaded) from a server. Unfortunately, uploading files to that server automatically changes their date/time stamp. So I immediately end up with files whose content is identical whether they are on the server, or whether they are on my local hard drive, while their date/time stamp is different. After some time I might make changes to some of the local files. This is why I need file comparison based on content. And it is not currently working in xplorer² Pro. Note that these files do not differ by white space only. They are either completely identical or differ by some text strings (and/or HTML tags too).

I still have a fall-back solution: CSDiff makes folder comparison perfectly well, with the added bonus that it shows the textual differences between files with identical names.

But it would be more practical if I could use xplorer² Pro directly for folder comparison based on file content.

By the way, I did not see how I should go about enabling “loose name matching” in scrap windows. Should I send all files from both folders to a single Scrap Window? It seems Ctrl+F9 is not available then.

Now if I enable the Checksum column I can see that the files xplorer² reports as identical by content are those which have identical checksum numbers. Other identical files have different checksums and are reported as different in content, but they aren’t, not even by white space! I have 3 different file comparison utilities (ExamDiff, CSDiff, and an old editor with file comparison capability): all 3 give me identical content (white space included) for files that xplorer² reports as different in content…

If white space was the culprit, why would the content of files which are reported as different when placed in the original folders become identical simply when I copy these same files to a different (empty) folder? Why would they be different in one folder and not in the other if content is the basis for comparison?

Robert2 · Post by **Robert2** » 2010 Aug 15, 15:29

I was wrong. I must have compared wrong folders. The files are analyzed in the same way, whether they are copied or left in the original folders. The difference must be with the checksum.

fgagnon · Post by **fgagnon** » 2010 Aug 15, 16:59

But if the checksums are different, the files are different.
Maybe an odd format difference made automatically by the storage system protocol

Robert2 · Post by **Robert2** » 2010 Aug 15, 19:12

That the files are different in some way does not mean that their content is different. Unless we understand different things by “content”… HTML files are universally regarded as “pure text” files.

fgagnon · Post by **fgagnon** » 2010 Aug 15, 19:44

I agree wrt human interpreted content.
Unfortunately, we are asking for a machine to make the comparison, and what it ignores as irrelevant is only as good as the human programmer has specified.

Robert2 · Post by **Robert2** » 2010 Aug 15, 21:17

I have downloaded, installed and run “Compare Suite light” (http://www.comparesuite.com/csfree.exe) on the same 2 folders. The comparison based on “content” gave correct results: all files were reported as identical (which they were) except for the one file which I had purposefully modified for test purposes.

I wish xplorer² Pro would behave in a similar way. I have sent 2 of these “different” files to Nikos. Maybe he’ll be able to determine what makes them so “different” in xplorer² eyes…

Post by **nikos** » 2010 Aug 16, 05:01

xplorer2 compares files byte by byte. Even one byte amiss means that files are different. That could be formatting for example or a newline

Gandolf · Post by **Gandolf** » 2010 Aug 16, 06:20

Have you tried a binary compare using any of your file compare programs?

I've had a situation where the EOL characters are different, CR/LF (DOS) as apposed to just a CR (MAC) or an LF (UNIX) character. This can happen if the editor used changes the saved format.

Robert2 · Post by **Robert2** » 2010 Aug 16, 15:03

Hi everybody,
Let me make one thing absolutely clear. Most of these files are identical. If they are reported as “different” by xplorer², it is only because of some change made during the upload process, or by the server as it stores them. Nikos has had a look at my sample files. They are reported as different because they are “NOT binary identical”. The new line characters are coded differently from one file to the next: one uses the Unix style 0A and the other one the Windows style 0D 0A. Nikos suggests using software that doesn't alter the new line character or that I transfer files in binary mode. I am using FileZilla which serves my purposes very well. I have had a look at the FileZilla options. By default, FileZilla treats “.htm” and “.html” files as “ASCII” files, which makes sense (to me at least). This is the probable cause of the change in coding between my local Windows hard drive and the Unix server. So I have now removed “.htm” and “.html” files from the FileZilla list of “ASCII” files, and have set it up so that all uploads are made in “binary mode”. I assume this solves my problem.

This said, I do wish xplorer² Pro had an option to compare ASCII files by textual content, not byte by byte. All the file/folder comparison utilities I have do so. Comparing text files byte by byte is completely beside the point.

Kilmatead · Post by **Kilmatead** » 2010 Aug 16, 16:28

A little history about carriage-feeds and line-feeds.

I remember being flummoxed by this when learning C some quarter century ago. Pascal and Fortran didn't behave the same (at the time), nor did Ada (which no one knew anyway). Though as terminals were still populated by the DECWriter variety (basically printers-as-screens) such concepts were important in the labs - never mind the attendant confusion over Character Deletion and Backspace (and the resulting position of the carriage).

Appears even dominancy of the one and death of the others still hasn't quelled the curse.

Wikipedia wrote:The most commonly used character encoding on the World Wide Web was US-ASCII until December 2007, when it was surpassed by UTF-8

Robert2 wrote:Comparing text files byte by byte is completely beside the point.

Not really.

Considering all the above and that many of the non-printing codes in ASCII have been treated as obsolete (and therefore "open to interpretation") for years, it's no longer really a "standard" as such. So if forced to "pick one and stick with it" - binary is the only logical choice. HTML is sort of the anti-christ of the WYSIWYG generation anyway, isn't it?

Robert2 · Post by **Robert2** » 2010 Aug 16, 17:54

Christ or Antichrist, I would not know. I am not concerned with theology or esoterics here, only in very basic practical matters.

As I said, all the file/folder comparison utilities I have compare HTML files as “text” files, not as “binary” files. This approach might have become “non-standard” or “obsolete”, but from the end-user’s standpoint, it is the only approach that makes sense. I only want to determine one thing: has any change been made to the text in these HTML files? By “text” I mean the HTML tags and/or the associated text strings. All other concerns are completely irrelevant to my purpose.

I don’t see why this should not be offered at least as an option in xplorer² Pro. It seems all other file/folder comparison utilities do offer it, or automatically compare HTML files in that natural way.

Besides, if the “non-printing codes” in ASCII were such a problem, the Unix server would not “swallow” my “Windows style” HTML files with such ease… This is another reason why xplorer² should not bother with whatever “non-printing codes” Unix servers use to store HTML files.

Note that “non-printing codes” is a misnomer here: HTML tags are “non-printing” but they are integral part of the “textual content” of the files…