Finding duplicate files based on content (2)

Discussion & Support for xplorer² professional

Moderators: fgagnon, nikos, Site Mods

Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Finding duplicate files based on content (2)

Post by Robert2 »

Hi,
I am running xplorer² Pro 1.8.0.12 [Unicode].
I have 2 folders with plenty of HTML files each. Most of the files have identical content from one folder to the other.
If I use the Synch Wizard to compare the 2 folders by “content”, only a few files are reported as identical even though most of them are.
Now if I invert the selection to highlight only the files which are supposed to be different, copy those files to an empty folder, and make a new comparison with the Synch Wizard, all the copied files are reported as identical by content with the original files. This is as expected. But why on earth should xplorer² report these same files as different by content when they were located in their original folders?
At some point in the past (in 2004, see posting.php?mode=quote&p=11699), a similar bug was found and fixed. It seems the bug is back...
User avatar
nikos
Site Admin
Site Admin
Posts: 15794
Joined: 2002 Feb 07, 15:57
Location: UK
Contact:

Post by nikos »

the sync wizard first tries to match folders and names. See if ticking 'loose name matching' box in Ctrl+F9 helps
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

nikos wrote:See if ticking 'loose name matching' box in Ctrl+F9 helps
That option is greyed-out (unavailable) on my system (XP SP3), whichever other option is ticked in Ctrl+F9.
What gives?
User avatar
nikos
Site Admin
Site Admin
Posts: 15794
Joined: 2002 Feb 07, 15:57
Location: UK
Contact:

Post by nikos »

you won't see this enabled for single folder comparison, only in scrap windows

what's the smallest test case that will reproduce this bug? Also enable the checksum column, some files may differ in whitespace only
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

Maybe I should have given more details.
The 2 folders (folder panes in xplorer²) contain files with identical names. Only the content of a few files might be different from one pane to the next because of modifications. To speed work up I need to know (at one glance if possible) which files were modified (if any).
Note that the file date/time stamps are irrelevant here because one folder contains purely local files, when the other folder contains files localized (downloaded) from a server. Unfortunately, uploading files to that server automatically changes their date/time stamp. So I immediately end up with files whose content is identical whether they are on the server, or whether they are on my local hard drive, while their date/time stamp is different. After some time I might make changes to some of the local files. This is why I need file comparison based on content. And it is not currently working in xplorer² Pro. Note that these files do not differ by white space only. They are either completely identical or differ by some text strings (and/or HTML tags too).

I still have a fall-back solution: CSDiff makes folder comparison perfectly well, with the added bonus that it shows the textual differences between files with identical names.

But it would be more practical if I could use xplorer² Pro directly for folder comparison based on file content.

By the way, I did not see how I should go about enabling “loose name matching” in scrap windows. Should I send all files from both folders to a single Scrap Window? It seems Ctrl+F9 is not available then.

Now if I enable the Checksum column I can see that the files xplorer² reports as identical by content are those which have identical checksum numbers. Other identical files have different checksums and are reported as different in content, but they aren’t, not even by white space! I have 3 different file comparison utilities (ExamDiff, CSDiff, and an old editor with file comparison capability): all 3 give me identical content (white space included) for files that xplorer² reports as different in content…

If white space was the culprit, why would the content of files which are reported as different when placed in the original folders become identical simply when I copy these same files to a different (empty) folder? Why would they be different in one folder and not in the other if content is the basis for comparison?
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

I was wrong. I must have compared wrong folders. The files are analyzed in the same way, whether they are copied or left in the original folders. The difference must be with the checksum.
User avatar
fgagnon
Site Admin
Site Admin
Posts: 3737
Joined: 2003 Sep 08, 19:56
Location: Springfield

Post by fgagnon »

But if the checksums are different, the files are different.
Maybe an odd format difference made automatically by the storage system protocol :crazy:
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

That the files are different in some way does not mean that their content is different. Unless we understand different things by “content”… HTML files are universally regarded as “pure text” files.
User avatar
fgagnon
Site Admin
Site Admin
Posts: 3737
Joined: 2003 Sep 08, 19:56
Location: Springfield

Post by fgagnon »

I agree wrt human interpreted content.
Unfortunately, we are asking for a machine to make the comparison, and what it ignores as irrelevant is only as good as the human programmer has specified.  :shrug:
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

I have downloaded, installed and run “Compare Suite light” (http://www.comparesuite.com/csfree.exe) on the same 2 folders. The comparison based on “content” gave correct results: all files were reported as identical (which they were) except for the one file which I had purposefully modified for test purposes.

I wish xplorer² Pro would behave in a similar way. I have sent 2 of these “different” files to Nikos. Maybe he’ll be able to determine what makes them so “different” in xplorer² eyes…
User avatar
nikos
Site Admin
Site Admin
Posts: 15794
Joined: 2002 Feb 07, 15:57
Location: UK
Contact:

Post by nikos »

xplorer2 compares files byte by byte. Even one byte amiss means that files are different. That could be formatting for example or a newline
Gandolf
Gold Member
Gold Member
Posts: 470
Joined: 2004 Jun 12, 10:47

Post by Gandolf »

Have you tried a binary compare using any of your file compare programs?

I've had a situation where the EOL characters are different, CR/LF (DOS) as apposed to just a CR (MAC) or an LF (UNIX) character. This can happen if the editor used changes the saved format.
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

Hi everybody,
Let me make one thing absolutely clear. Most of these files are identical. If they are reported as “different” by xplorer², it is only because of some change made during the upload process, or by the server as it stores them. Nikos has had a look at my sample files. They are reported as different because they are “NOT binary identical”. The new line characters are coded differently from one file to the next: one uses the Unix style 0A and the other one the Windows style 0D 0A. Nikos suggests using software that doesn't alter the new line character or that I transfer files in binary mode. I am using FileZilla which serves my purposes very well. I have had a look at the FileZilla options. By default, FileZilla treats “.htm” and “.html” files as “ASCII” files, which makes sense (to me at least). This is the probable cause of the change in coding between my local Windows hard drive and the Unix server. So I have now removed “.htm” and “.html” files from the FileZilla list of “ASCII” files, and have set it up so that all uploads are made in “binary mode”. I assume this solves my problem.

This said, I do wish xplorer² Pro had an option to compare ASCII files by textual content, not byte by byte. All the file/folder comparison utilities I have do so. Comparing text files byte by byte is completely beside the point.
Kilmatead
Platinum Member
Platinum Member
Posts: 4578
Joined: 2008 Sep 30, 06:52
Location: Dublin

Post by Kilmatead »

A little history about carriage-feeds and line-feeds.

I remember being flummoxed by this when learning C some quarter century ago.  Pascal and Fortran didn't behave the same (at the time), nor did Ada (which no one knew anyway).  Though as terminals were still populated by the DECWriter variety (basically printers-as-screens) such concepts were important in the labs - never mind the attendant confusion over Character Deletion and Backspace (and the resulting position of the carriage).

Appears even dominancy of the one and death of the others still hasn't quelled the curse. :wink:
Wikipedia wrote:The most commonly used character encoding on the World Wide Web was US-ASCII until December 2007, when it was surpassed by UTF-8
Robert2 wrote:Comparing text files byte by byte is completely beside the point.
Not really.

Considering all the above and that many of the non-printing codes in ASCII have been treated as obsolete (and therefore "open to interpretation") for years, it's no longer really a "standard" as such.  So if forced to "pick one and stick with it" - binary is the only logical choice.  HTML is sort of the anti-christ of the WYSIWYG generation anyway, isn't it?
Robert2
Gold Member
Gold Member
Posts: 673
Joined: 2004 Jun 17, 15:39

Post by Robert2 »

Christ or Antichrist, I would not know. I am not concerned with theology or esoterics here, only in very basic practical matters.

As I said, all the file/folder comparison utilities I have compare HTML files as “text” files, not as “binary” files. This approach might have become “non-standard” or “obsolete”, but from the end-user’s standpoint, it is the only approach that makes sense. I only want to determine one thing: has any change been made to the text in these HTML files? By “text” I mean the HTML tags and/or the associated text strings. All other concerns are completely irrelevant to my purpose.

I don’t see why this should not be offered at least as an option in xplorer² Pro. It seems all other file/folder comparison utilities do offer it, or automatically compare HTML files in that natural way.

Besides, if the “non-printing codes” in ASCII were such a problem, the Unix server would not “swallow” my “Windows style” HTML files with such ease… This is another reason why xplorer² should not bother with whatever “non-printing codes” Unix servers use to store HTML files.

Note that “non-printing codes” is a misnomer here: HTML tags are “non-printing” but they are integral part of the “textual content” of the files…
Post Reply