Finding duplicate files based on content (2)
Moderators: fgagnon, nikos, Site Mods
Finding duplicate files based on content (2)
Hi,
I am running xplorer² Pro 1.8.0.12 [Unicode].
I have 2 folders with plenty of HTML files each. Most of the files have identical content from one folder to the other.
If I use the Synch Wizard to compare the 2 folders by “content”, only a few files are reported as identical even though most of them are.
Now if I invert the selection to highlight only the files which are supposed to be different, copy those files to an empty folder, and make a new comparison with the Synch Wizard, all the copied files are reported as identical by content with the original files. This is as expected. But why on earth should xplorer² report these same files as different by content when they were located in their original folders?
At some point in the past (in 2004, see posting.php?mode=quote&p=11699), a similar bug was found and fixed. It seems the bug is back...
I am running xplorer² Pro 1.8.0.12 [Unicode].
I have 2 folders with plenty of HTML files each. Most of the files have identical content from one folder to the other.
If I use the Synch Wizard to compare the 2 folders by “content”, only a few files are reported as identical even though most of them are.
Now if I invert the selection to highlight only the files which are supposed to be different, copy those files to an empty folder, and make a new comparison with the Synch Wizard, all the copied files are reported as identical by content with the original files. This is as expected. But why on earth should xplorer² report these same files as different by content when they were located in their original folders?
At some point in the past (in 2004, see posting.php?mode=quote&p=11699), a similar bug was found and fixed. It seems the bug is back...
Maybe I should have given more details.
The 2 folders (folder panes in xplorer²) contain files with identical names. Only the content of a few files might be different from one pane to the next because of modifications. To speed work up I need to know (at one glance if possible) which files were modified (if any).
Note that the file date/time stamps are irrelevant here because one folder contains purely local files, when the other folder contains files localized (downloaded) from a server. Unfortunately, uploading files to that server automatically changes their date/time stamp. So I immediately end up with files whose content is identical whether they are on the server, or whether they are on my local hard drive, while their date/time stamp is different. After some time I might make changes to some of the local files. This is why I need file comparison based on content. And it is not currently working in xplorer² Pro. Note that these files do not differ by white space only. They are either completely identical or differ by some text strings (and/or HTML tags too).
I still have a fall-back solution: CSDiff makes folder comparison perfectly well, with the added bonus that it shows the textual differences between files with identical names.
But it would be more practical if I could use xplorer² Pro directly for folder comparison based on file content.
By the way, I did not see how I should go about enabling “loose name matching” in scrap windows. Should I send all files from both folders to a single Scrap Window? It seems Ctrl+F9 is not available then.
Now if I enable the Checksum column I can see that the files xplorer² reports as identical by content are those which have identical checksum numbers. Other identical files have different checksums and are reported as different in content, but they aren’t, not even by white space! I have 3 different file comparison utilities (ExamDiff, CSDiff, and an old editor with file comparison capability): all 3 give me identical content (white space included) for files that xplorer² reports as different in content…
If white space was the culprit, why would the content of files which are reported as different when placed in the original folders become identical simply when I copy these same files to a different (empty) folder? Why would they be different in one folder and not in the other if content is the basis for comparison?
The 2 folders (folder panes in xplorer²) contain files with identical names. Only the content of a few files might be different from one pane to the next because of modifications. To speed work up I need to know (at one glance if possible) which files were modified (if any).
Note that the file date/time stamps are irrelevant here because one folder contains purely local files, when the other folder contains files localized (downloaded) from a server. Unfortunately, uploading files to that server automatically changes their date/time stamp. So I immediately end up with files whose content is identical whether they are on the server, or whether they are on my local hard drive, while their date/time stamp is different. After some time I might make changes to some of the local files. This is why I need file comparison based on content. And it is not currently working in xplorer² Pro. Note that these files do not differ by white space only. They are either completely identical or differ by some text strings (and/or HTML tags too).
I still have a fall-back solution: CSDiff makes folder comparison perfectly well, with the added bonus that it shows the textual differences between files with identical names.
But it would be more practical if I could use xplorer² Pro directly for folder comparison based on file content.
By the way, I did not see how I should go about enabling “loose name matching” in scrap windows. Should I send all files from both folders to a single Scrap Window? It seems Ctrl+F9 is not available then.
Now if I enable the Checksum column I can see that the files xplorer² reports as identical by content are those which have identical checksum numbers. Other identical files have different checksums and are reported as different in content, but they aren’t, not even by white space! I have 3 different file comparison utilities (ExamDiff, CSDiff, and an old editor with file comparison capability): all 3 give me identical content (white space included) for files that xplorer² reports as different in content…
If white space was the culprit, why would the content of files which are reported as different when placed in the original folders become identical simply when I copy these same files to a different (empty) folder? Why would they be different in one folder and not in the other if content is the basis for comparison?
I have downloaded, installed and run “Compare Suite light” (http://www.comparesuite.com/csfree.exe) on the same 2 folders. The comparison based on “content” gave correct results: all files were reported as identical (which they were) except for the one file which I had purposefully modified for test purposes.
I wish xplorer² Pro would behave in a similar way. I have sent 2 of these “different” files to Nikos. Maybe he’ll be able to determine what makes them so “different” in xplorer² eyes…
I wish xplorer² Pro would behave in a similar way. I have sent 2 of these “different” files to Nikos. Maybe he’ll be able to determine what makes them so “different” in xplorer² eyes…
Hi everybody,
Let me make one thing absolutely clear. Most of these files are identical. If they are reported as “different” by xplorer², it is only because of some change made during the upload process, or by the server as it stores them. Nikos has had a look at my sample files. They are reported as different because they are “NOT binary identical”. The new line characters are coded differently from one file to the next: one uses the Unix style 0A and the other one the Windows style 0D 0A. Nikos suggests using software that doesn't alter the new line character or that I transfer files in binary mode. I am using FileZilla which serves my purposes very well. I have had a look at the FileZilla options. By default, FileZilla treats “.htm” and “.html” files as “ASCII” files, which makes sense (to me at least). This is the probable cause of the change in coding between my local Windows hard drive and the Unix server. So I have now removed “.htm” and “.html” files from the FileZilla list of “ASCII” files, and have set it up so that all uploads are made in “binary mode”. I assume this solves my problem.
This said, I do wish xplorer² Pro had an option to compare ASCII files by textual content, not byte by byte. All the file/folder comparison utilities I have do so. Comparing text files byte by byte is completely beside the point.
Let me make one thing absolutely clear. Most of these files are identical. If they are reported as “different” by xplorer², it is only because of some change made during the upload process, or by the server as it stores them. Nikos has had a look at my sample files. They are reported as different because they are “NOT binary identical”. The new line characters are coded differently from one file to the next: one uses the Unix style 0A and the other one the Windows style 0D 0A. Nikos suggests using software that doesn't alter the new line character or that I transfer files in binary mode. I am using FileZilla which serves my purposes very well. I have had a look at the FileZilla options. By default, FileZilla treats “.htm” and “.html” files as “ASCII” files, which makes sense (to me at least). This is the probable cause of the change in coding between my local Windows hard drive and the Unix server. So I have now removed “.htm” and “.html” files from the FileZilla list of “ASCII” files, and have set it up so that all uploads are made in “binary mode”. I assume this solves my problem.
This said, I do wish xplorer² Pro had an option to compare ASCII files by textual content, not byte by byte. All the file/folder comparison utilities I have do so. Comparing text files byte by byte is completely beside the point.
A little history about carriage-feeds and line-feeds.
I remember being flummoxed by this when learning C some quarter century ago. Pascal and Fortran didn't behave the same (at the time), nor did Ada (which no one knew anyway). Though as terminals were still populated by the DECWriter variety (basically printers-as-screens) such concepts were important in the labs - never mind the attendant confusion over Character Deletion and Backspace (and the resulting position of the carriage).
Appears even dominancy of the one and death of the others still hasn't quelled the curse.
Considering all the above and that many of the non-printing codes in ASCII have been treated as obsolete (and therefore "open to interpretation") for years, it's no longer really a "standard" as such. So if forced to "pick one and stick with it" - binary is the only logical choice. HTML is sort of the anti-christ of the WYSIWYG generation anyway, isn't it?
I remember being flummoxed by this when learning C some quarter century ago. Pascal and Fortran didn't behave the same (at the time), nor did Ada (which no one knew anyway). Though as terminals were still populated by the DECWriter variety (basically printers-as-screens) such concepts were important in the labs - never mind the attendant confusion over Character Deletion and Backspace (and the resulting position of the carriage).
Appears even dominancy of the one and death of the others still hasn't quelled the curse.
Wikipedia wrote:The most commonly used character encoding on the World Wide Web was US-ASCII until December 2007, when it was surpassed by UTF-8
Not really.Robert2 wrote:Comparing text files byte by byte is completely beside the point.
Considering all the above and that many of the non-printing codes in ASCII have been treated as obsolete (and therefore "open to interpretation") for years, it's no longer really a "standard" as such. So if forced to "pick one and stick with it" - binary is the only logical choice. HTML is sort of the anti-christ of the WYSIWYG generation anyway, isn't it?
Christ or Antichrist, I would not know. I am not concerned with theology or esoterics here, only in very basic practical matters.
As I said, all the file/folder comparison utilities I have compare HTML files as “text” files, not as “binary” files. This approach might have become “non-standard” or “obsolete”, but from the end-user’s standpoint, it is the only approach that makes sense. I only want to determine one thing: has any change been made to the text in these HTML files? By “text” I mean the HTML tags and/or the associated text strings. All other concerns are completely irrelevant to my purpose.
I don’t see why this should not be offered at least as an option in xplorer² Pro. It seems all other file/folder comparison utilities do offer it, or automatically compare HTML files in that natural way.
Besides, if the “non-printing codes” in ASCII were such a problem, the Unix server would not “swallow” my “Windows style” HTML files with such ease… This is another reason why xplorer² should not bother with whatever “non-printing codes” Unix servers use to store HTML files.
Note that “non-printing codes” is a misnomer here: HTML tags are “non-printing” but they are integral part of the “textual content” of the files…
As I said, all the file/folder comparison utilities I have compare HTML files as “text” files, not as “binary” files. This approach might have become “non-standard” or “obsolete”, but from the end-user’s standpoint, it is the only approach that makes sense. I only want to determine one thing: has any change been made to the text in these HTML files? By “text” I mean the HTML tags and/or the associated text strings. All other concerns are completely irrelevant to my purpose.
I don’t see why this should not be offered at least as an option in xplorer² Pro. It seems all other file/folder comparison utilities do offer it, or automatically compare HTML files in that natural way.
Besides, if the “non-printing codes” in ASCII were such a problem, the Unix server would not “swallow” my “Windows style” HTML files with such ease… This is another reason why xplorer² should not bother with whatever “non-printing codes” Unix servers use to store HTML files.
Note that “non-printing codes” is a misnomer here: HTML tags are “non-printing” but they are integral part of the “textual content” of the files…