Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

A collection of especially useful xplorer² topics and ideas. New users may find it helpful to look here before searching the other forums for information. >>>>>> Please post new material in the relevant forum. (New stuff posted here will be removed.) Thanks. -fg-

Moderators: fgagnon, nikos

Post Reply
Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 15, 11:56

Gulliver's Travels Wiki wrote:Traditionally, Lilliputians broke boiled eggs on the larger end; a few generations ago, an Emperor of Lilliput, the Present Emperor's great-grandfather, had decreed that all eggs be broken on the smaller end after his son cut himself breaking the egg on the larger end. The differences between Big-Endians (those who broke their eggs at the larger end) and Little-Endians had given rise to "six rebellions... wherein one Emperor lost his life, and another his crown". The Lilliputian religion says an egg should be broken on the convenient end, which is now interpreted by the Lilliputians as the smaller end.
What it is:

For anyone familiar with international text, you'll often have files composed of several character sets within multiple differing encodings (ASCII, Unicode, etc). Mostly people tend to stick with one encoding and just leave it at that, but often find things they've downloaded or have had edited and returned by friends overseas are no longer recognised by their associated programmes and environments. In particular this can be influenced by the End-of-Line encodings used in cross-platform environments, where Windows uses CRLF and Unix/Linux simply LF.

The traditional way to figure out what encoding a text file has is to just load it into a text-editor such as Notepad++ (or whatever) and check there. This is fine for one or two files, but what if you have hundreds of files across multiple projects where you wish you could "just see" (or search for) whatever encoding/character-set/EOLN-style each file actually contains when browsing?

That's what the Endian plugin does for x2 - it provides two custom columns which automatically display those per-file properties, one column for the "Encoding" itself, and the second as an EOLN (End-of-Line) style indicator. The file contents are not changed in any way, these details are heuristically sourced by reading a portion of the data in real-time.

Image

How to use it:

This is a WDX "Content" Plugin, supported as of x2 Pro/Ult v3.1.0.0; all you need to do is download the x2 Plugin Manager (no installation required), and the plugin itself: Endian v1.0.0.5.

Extract the archives and run the plugin manager, then drag-&-drop either the 64-bit plugin (Endian.WDX64) or 32-bit (Endian.WDX) into the window and click "Apply". You will need to restart x2 - close it completely first, using <Alt+X> or File -> Exit.

Once x2 is running, use <Alt+K> to enter the column selection dialog, and scroll all the way to the bottom of the available columns list - there you'll find entries named "Encoding.Endian [X]" and "EOLN.Endian [X]" - just double-click one (or both), and you're done. :D

(After you've run the plugin for the the first time, you can press <F5> or the Reset button in the plugin manager and you'll see the Configuration/Detection-String has been automatically populated for you... I chose a simple default set of common text-file extensions for which the plugin will apply itself, but you may edit them (just double-click the plugin) to remove/add-more at your leisure. Remember to restart x2 fully to apply any changes you may have made.)

How it works:

For simplicity's sake, any text file with a BOM (Byte-Order-Mark) is just identified directly by it, so UTF-8/16/32 files are reliable enough to identify directly. Yes, I know BOM's are not 100% infallible, but they're a very strong indicator of content. That said, many UTF-8 files commonly don't have BOM's, so for those (and any other non-BOM files) the first 512 bytes of the file are read-into a buffer to be scanned and (hopefully) identified. The plugin uses the uchardet library which is originally ported from Mozilla's Universal Character Set code, commonly employed by many browsers and text editors (Notepad++, Firefox, etc).

Generally, uchardet is quite reliable - though as heuristic identification goes, no system can be 100% accurate, so consider the results you get (for either the Encoding or the End-of-Line type) to be a "strong indicator" of content, rather than received wisdom from any given deity.

For example, BOM-less UTF-8 files may identify as ASCII if they don't actually include any unicode characters in their content. This happens because ASCII forms the initial (lower) basic set of the single-byte UTF-8 encoding, and there's no way to tell the difference from content alone (unless, as mentioned, characters from higher-up the set are actually used). Hence the reason many people prefer to use BOM's to ensure overall consistency in a filebase. :shrug:

If you encounter a lot of false-positives, the default buffer-size can be increased (see below) to almost any size (I chose 512 bytes as a default since that passes most of the given tests - but the more data supplied to uchardet, the more accurate it will be at the expense of speed). I don't recommend anything less than 512 bytes, but anything greater is fine. You can even set an absurdly high number just to make sure all files are read-in fully, but depending on the disc-drive type, these may take longer to process for the columns to populate. Keep in mind that buffersizes are in bytes, so if you're dealing with unicode files (where individual characters may take up to 4-bytes each), it's better to err on the side of caution and just over-provision to read as many characters as you deem practical.

Customisation:

The plugin automatically creates an eponymous configuration file ("Endian.ini") to contain settings. This file is created in the same folder as the WDX itself, so make sure you have write-access to that folder if you want to be able to customise the plugin behaviour.

In the interests of localisation, you may label the columns themselves anything you wish in your local language, if you're more comfortable with that. Just remember that you cannot use any unicode characters in the ColumnLabel=<name> itself (ironically). This is a limitation of plugins in general (they are an odd mix of ASCII and Unicode internal functions), so that's life. :wink:

The ShowEndianness key determines whether the column displays LE/BE (Little-Endian/Big-Endian) after the UTF code, or BOM for UTF-8 files.

IndicateUTF8SansBOM works the same as the column-label - just replace the text after the '=' with whatever you want it to say for UTF-8 files without BOM's (obviously). Again, I just chose a simple default for simplicity. If you don't want this at all and just find it distracting, remove any text and leave the area after the '=' sign blank. I included this option so that you can search for UTF-8 files which specifically have no BOM's - just use <Ctrl+F> (in x2) and set an advanced rule which checks the Encoding.Endian [X] property for whatever text you use. This search-ability applies to all other character-set encodings as well - any label or partial-label can be searched for, system-wide. :D (As plugins also work in Nikos' more advanced DeskRule search utility, once you've set-up the plugin in x2, it will automatically be available as a property to search in Deskrule too. Handy, that. :wink:

The EOLN column has its own (separate) BufferSize setting, which may be set to 0 (the default) meaning it reads the entire file-contents, or (in case of slow-networks, etc) you may set a specific size (in bytes). While it is not usually necessary to read the whole file just to determine End-of-Line types, it does provide the extra information of how many lines the file has. Technically, the number shown is the number of EOLN's themselves, which is usually 1 less than the number of lines in any actual document (as viewed in an editor). If a file is empty or has no End-of-Line characters in it, <Sans-EOLN> is displayed.

The "Win CRLF", "Unix LF", or "Mac CR" labels may be customised as well, if desired.

Final Thoughts:

Special thanks to BYVoid for hosting, updating, and generally maintaining the uchardet library itself, as well as the multi-lingual Test Encoding files used for verification purposes. All credit goes to him (and, originally, Mozilla) - I just did the repackaging, they do the heavy-lifting.

And thanks to Nikos for pointing me in the right direction about C-compiler name-mangling - nothing is ever what it seems after it's been processed, be it McDonald's hamburgers or programme source-code. :shock:

Available Character-Set Label Results:
  • International (Unicode)
    • · UTF-8
      · UTF-16BE / UTF-16LE
      · UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
    Arabic
    • · ISO-8859-6
      · WINDOWS-1256
    Bulgarian
    • · ISO-8859-5
      · WINDOWS-1251
    Chinese
    • · ISO-2022-CN
      · BIG5
      · EUC-TW
      · GB18030
      · HZ-GB-2312
    Danish
    • · ISO-8859-1
      · ISO-8859-15
      · WINDOWS-1252
    English
    • · ASCII
    Esperanto
    • · ISO-8859-3
    French
    • · ISO-8859-1
      · ISO-8859-15
      · WINDOWS-1252
    German
    • · ISO-8859-1
      · WINDOWS-1252
    Greek
    • · ISO-8859-7
      · WINDOWS-1253
    Hebrew
    • · ISO-8859-8
      · WINDOWS-1255
    Hungarian:
    • · ISO-8859-2
      · WINDOWS-1250
    Japanese
    • · ISO-2022-JP
      · SHIFT_JIS
      · EUC-JP
    Korean
    • · ISO-2022-KR
      · EUC-KR
    Russian
    • · ISO-8859-5
      · KOI8-R
      · WINDOWS-1251
      · MAC-CYRILLIC
      · IBM866
      · IBM855
    Spanish
    • · ISO-8859-1
      · ISO-8859-15
      · WINDOWS-1252
    Thai
    • · TIS-620
      · ISO-8859-11
    Turkish:
    • · ISO-8859-3
      · ISO-8859-9
    Vietnamese:
    • · VISCII
      · Windows-1258
    Others
    • · WINDOWS-1252
Last edited by Kilmatead on 2017 Jul 19, 09:45, edited 5 times in total.

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 15, 12:03

It should be mentioned that Nikos didn't this would be of much use to anyone other than "Linux Freaks". (It's just like him to ruin all the fun of my first plugin - I slave for hours and hours and he just bursts my bubble of pride in a shower of dismissal and indifference. Humbug.)

Personally, I think it's rather cool (and useful), so if you're on my side of the fence, shout-out and prove the old grumpy git wrong (and by Jesus was he ever grumpy this week - cheer up, old sod!).

Somebody's got to carry the flag for the international kids, and if it's up to us to shoulder the burden, we'll rise to the challenge while he ponders his navel and sips his Mai Tai's in the sun, thinking he can just take all the credit while ducking the really useful issues. :D

Tuxman
Platinum Member
Platinum Member
Posts: 1483
Joined: 2009 Aug 19, 07:49

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Tuxman » 2016 Oct 16, 13:32

I find your latest time-waster at least recognisably useful.
Tux. ; tuxproject.de ; Windows 10 x64
registered xplorer² pro user since Oct 2009, ultimated in Mar 2012

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 16, 13:57

Now if that ain't high-praise, I don't know what is!

<As he slides the sealed-envelope across the table...>
Last edited by Kilmatead on 2017 Jul 19, 09:46, edited 1 time in total.

Brig
Silver Member
Silver Member
Posts: 208
Joined: 2002 Aug 05, 16:01
Location: Michigan

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Brig » 2016 Oct 17, 15:11

K, this is fabulous! In 2012 I asked if something like this were possible: viewtopic.php?f=18&t=9622. Seeing this information at a glance is very useful to me--it prevents small headaches. Thank you! May you stay forever young.

One thing: Why does the column disappear when I enter and then return from a subfolder? Even after I "Save settings now."

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 17, 15:41

Brig wrote:Why does the column disappear when I enter and then return from a subfolder? Even after I "Save settings now."
Columns are saved using Actions -> Folder Settings, they are only saved with "normal" settings if you're not already in a folder that has dedicated column selections - there they then become part of the default for folders otherwise bereft of a bespoke column set.

Also note that the addition/removal of other columns provided by 3rd-party shell-extensions, etc, can mess with the internal order of things, so you may find plugin columns arguing amongst themselves for who gets to ride in the front seat next to daddy and who doesn't. This is just the way custom columns work... if you don't add and remove shell extensions/plugins often, you should find it fairly consistent. Anything as simple as uninstalling a PDF reader you no longer use can wreak havoc with your happily organised universe. I'd love to petition Nikos for some better way of storing them (by name, perhaps, not by CLSID), but he'd just say he can only provide what the shell presents to him - and it's hard to argue with that. :shrug:

Incidentally, I actually spent a long time searching for that thread of yours just to show Nikos that I wasn't the only one who'd think this was the cat's-meow. It's somewhat ironic that I was the one providing all the excuses as to why it would be such a pain in the arse to implement! Always fun to prove one's self wrong in the end. :D

Brig
Silver Member
Silver Member
Posts: 208
Joined: 2002 Aug 05, 16:01
Location: Michigan

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Brig » 2016 Oct 17, 16:14

Kilmatead wrote:Columns are saved using Actions -> Folder Settings, they are only saved with "normal" settings if you're not already in a folder that has dedicated column selections - there they then become part of the default for folders otherwise bereft of a bespoke column set.
Of course. Thanks. It seems like every few years I ask this same question: Does this mean that every folder will now be littered with a desktop.ini file? I don't have that problem now, and I use some custom columns:

Name | Folder Size | Folder Children | File Children | Modified | Pages | Owner

Thanks.

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 17, 16:24

Yes, custom-folder settings are stored in the desktop.ini files. When in doubt, just delete the pesky things and only use bespoke settings for folders which really require them, like pictures (for thumbnail view), etc. The "type" of columns are not especially relevant when they are stored - [X] columns may be included as part of the default set just as easily as part of a specific one.

I tend to keep "regularly used" columns as part of the default set; this always struck me as the most practical method, with few desktop.ini's littering the place. The ini's are, however, convenient when you want to "copy" the settings from one folder to another one... just copy the .ini itself and forget it. :wink:

Also, don't forget that columns may also be placed at the bottom of the pane, where they don't take up all the width-space in folders when they're not oft used. The ones at the bottom only activate for the focused-item, which can also save on resources given sluggish PC's and drives.

This stuff could also be displayed in the <Ctrl+Z> details-pane as well... though that requires messing around with the HTML-goo inside, which, to be honest, never proved to be my forté, so others may better positioned to provide help with that approach.

Brig
Silver Member
Silver Member
Posts: 208
Joined: 2002 Aug 05, 16:01
Location: Michigan

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Brig » 2016 Oct 18, 16:34

It turns out I really despise those desktop.ini files. Really can't stand them. Moreover, they're inconvenient (at best) to have in many of the folders I use, which are also the ones I want your Endian column to appear in. So I made Endian part of the default column set for several of the tabs that I use for text files. I happily learned--or relearned, I think--that the default column set is per tab. That's quite convenient in this situation.

Now then, would it be possible to have your column also report the line-ending style of the files? DOS/Unix/Mac?

Thanks again K. Great work. Detroit says Hi.

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 18, 17:28

Brig wrote:Now then, would it be possible to have your column also report the line-ending style of the files? DOS/Unix/Mac?
In the sense that:

Unix / Linux / OS X uses LF (line feed, '\n', 0x0A)
Macs prior to OS X use CR (carriage return, '\r', 0x0D)
Windows / DOS uses CR+LF (carriage return followed by line feed, '\r\n', 0x0D0A)

...then I don't see why not. Would Detroit like this as a secondary column, or as appended text?

I'll see what I can concoct and what kind of unanticipated madness it throws back at me (trust me, there's always some kind of pernicious, sneaky, and underhanded evil lurking within every seemingly innocent idea). It would be too easy to just identify a woman by whether or not she contains DNA from Adam's rib. :wink:

Brig
Silver Member
Silver Member
Posts: 208
Joined: 2002 Aug 05, 16:01
Location: Michigan

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Brig » 2016 Oct 18, 18:36

Kilmatead wrote:Would Detroit like this as a secondary column, or as appended text?
Detroit likes appended text and secondary columns about equally--I can imagine decent arguments for either. Maybe the one that's easier for you is our favorite. We're an accommodating city.

Image

Best of luck with the madness and the evil. And with identifying women (find a better way than DNA analysis).

Kilmatead
Platinum Member
Platinum Member
Posts: 4569
Joined: 2008 Sep 30, 06:52
Location: Dublin

Re: Endian: A Plugin for Viewing ASCII/UTF-8/16/32 Encoding & Heuristic Detection of Charactersets

Post by Kilmatead » 2016 Oct 22, 13:48

The original post/download (above) has been updated to version 1.0.0.5 of the plugin, adding a separate EOLN column, amongst other minor improvements/bugfixes.
Kilmatead wrote:there's always some kind of pernicious, sneaky, and underhanded evil lurking within every seemingly innocent idea
And, as it turns out, heuristically detecting End-of-Line styles within a Unicode environment (where the characters-vs-bytes ratio does not necessarily correlate evenly) is fraught with its own curiously anomalous pitfalls (who knew it was this weird?). One gains a fresh respect for the writers of cross platform text-editors and data-comparison programmes.

I won't go into excruciating details, but as with Encoding itself, multi-byte End-of-Line detection (especially in the ever-popular UTF-8) is its own special punishment from the Binary Gods. That said, I made up a simple enough algorithm which should provide at least 98% reliably predictive results (100% in single-byte environments).

As "Mac" has now moved on to OSX, and thus uses the Unix standard of LF instead of CR I don't imagine many people will find that one too useful, but it's there just in case (there are tales on the internet that Windows Office running on OSX creates CSV files using lone-CR breaks, so you never know). Windows/DOS format will identify both CRLF and (the inverted) LFCR forms; Unix/Linux is, as mentioned, just LF.

I have not included any detection of the more obscure break-types NEL, LS, or PS, though they could be added if any truly desperate souls were... truly... desperate... :wink:

The complete changelog is included in the source code, but it won't be of too much interest to most people (does anyone really get excited about FILE_FLAG_SEQUENTIAL_SCAN any more?).

If you have done any custom-modifications to the original Encoding column (via Endian.INI), be aware that it will be automatically overwritten with a new one (to adapt the new options/layout), so you will need to reconfigure it manually - except for the Detection-String extensions, those will remain intact as they are stored separately.

Enjoy.

Post Reply