What it is:Gulliver's Travels Wiki wrote:Traditionally, Lilliputians broke boiled eggs on the larger end; a few generations ago, an Emperor of Lilliput, the Present Emperor's great-grandfather, had decreed that all eggs be broken on the smaller end after his son cut himself breaking the egg on the larger end. The differences between Big-Endians (those who broke their eggs at the larger end) and Little-Endians had given rise to "six rebellions... wherein one Emperor lost his life, and another his crown". The Lilliputian religion says an egg should be broken on the convenient end, which is now interpreted by the Lilliputians as the smaller end.
For anyone familiar with international text, you'll often have files composed of several character sets within multiple differing encodings (ASCII, Unicode, etc). Mostly people tend to stick with one encoding and just leave it at that, but often find things they've downloaded or have had edited and returned by friends overseas are no longer recognised by their associated programmes and environments. In particular this can be influenced by the End-of-Line encodings used in cross-platform environments, where Windows uses CRLF and Unix/Linux simply LF.
The traditional way to figure out what encoding a text file has is to just load it into a text-editor such as Notepad++ (or whatever) and check there. This is fine for one or two files, but what if you have hundreds of files across multiple projects where you wish you could "just see" (or search for) whatever encoding/character-set/EOLN-style each file actually contains when browsing?
That's what the Endian plugin does for x2 - it provides two custom columns which automatically display those per-file properties, one column for the "Encoding" itself, and the second as an EOLN (End-of-Line) style indicator. The file contents are not changed in any way, these details are heuristically sourced by reading a portion of the data in real-time.
How to use it:
This is a WDX "Content" Plugin, supported as of x2 Pro/Ult v3.1.0.0; all you need to do is download the x2 Plugin Manager (no installation required), and the plugin itself: Endian v1.0.0.5.
Extract the archives and run the plugin manager, then drag-&-drop either the 64-bit plugin (Endian.WDX64) or 32-bit (Endian.WDX) into the window and click "Apply". You will need to restart x2 - close it completely first, using <Alt+X> or File -> Exit.
Once x2 is running, use <Alt+K> to enter the column selection dialog, and scroll all the way to the bottom of the available columns list - there you'll find entries named "Encoding.Endian [X]" and "EOLN.Endian [X]" - just double-click one (or both), and you're done.
(After you've run the plugin for the the first time, you can press <F5> or the Reset button in the plugin manager and you'll see the Configuration/Detection-String has been automatically populated for you... I chose a simple default set of common text-file extensions for which the plugin will apply itself, but you may edit them (just double-click the plugin) to remove/add-more at your leisure. Remember to restart x2 fully to apply any changes you may have made.)
How it works:
For simplicity's sake, any text file with a BOM (Byte-Order-Mark) is just identified directly by it, so UTF-8/16/32 files are reliable enough to identify directly. Yes, I know BOM's are not 100% infallible, but they're a very strong indicator of content. That said, many UTF-8 files commonly don't have BOM's, so for those (and any other non-BOM files) the first 512 bytes of the file are read-into a buffer to be scanned and (hopefully) identified. The plugin uses the uchardet library which is originally ported from Mozilla's Universal Character Set code, commonly employed by many browsers and text editors (Notepad++, Firefox, etc).
Generally, uchardet is quite reliable - though as heuristic identification goes, no system can be 100% accurate, so consider the results you get (for either the Encoding or the End-of-Line type) to be a "strong indicator" of content, rather than received wisdom from any given deity.
For example, BOM-less UTF-8 files may identify as ASCII if they don't actually include any unicode characters in their content. This happens because ASCII forms the initial (lower) basic set of the single-byte UTF-8 encoding, and there's no way to tell the difference from content alone (unless, as mentioned, characters from higher-up the set are actually used). Hence the reason many people prefer to use BOM's to ensure overall consistency in a filebase.
If you encounter a lot of false-positives, the default buffer-size can be increased (see below) to almost any size (I chose 512 bytes as a default since that passes most of the given tests - but the more data supplied to uchardet, the more accurate it will be at the expense of speed). I don't recommend anything less than 512 bytes, but anything greater is fine. You can even set an absurdly high number just to make sure all files are read-in fully, but depending on the disc-drive type, these may take longer to process for the columns to populate. Keep in mind that buffersizes are in bytes, so if you're dealing with unicode files (where individual characters may take up to 4-bytes each), it's better to err on the side of caution and just over-provision to read as many characters as you deem practical.
Customisation:
The plugin automatically creates an eponymous configuration file ("Endian.ini") to contain settings. This file is created in the same folder as the WDX itself, so make sure you have write-access to that folder if you want to be able to customise the plugin behaviour.
In the interests of localisation, you may label the columns themselves anything you wish in your local language, if you're more comfortable with that. Just remember that you cannot use any unicode characters in the ColumnLabel=<name> itself (ironically). This is a limitation of plugins in general (they are an odd mix of ASCII and Unicode internal functions), so that's life.
The ShowEndianness key determines whether the column displays LE/BE (Little-Endian/Big-Endian) after the UTF code, or BOM for UTF-8 files.
IndicateUTF8SansBOM works the same as the column-label - just replace the text after the '=' with whatever you want it to say for UTF-8 files without BOM's (obviously). Again, I just chose a simple default for simplicity. If you don't want this at all and just find it distracting, remove any text and leave the area after the '=' sign blank. I included this option so that you can search for UTF-8 files which specifically have no BOM's - just use <Ctrl+F> (in x2) and set an advanced rule which checks the Encoding.Endian [X] property for whatever text you use. This search-ability applies to all other character-set encodings as well - any label or partial-label can be searched for, system-wide. (As plugins also work in Nikos' more advanced DeskRule search utility, once you've set-up the plugin in x2, it will automatically be available as a property to search in Deskrule too. Handy, that.
The EOLN column has its own (separate) BufferSize setting, which may be set to 0 (the default) meaning it reads the entire file-contents, or (in case of slow-networks, etc) you may set a specific size (in bytes). While it is not usually necessary to read the whole file just to determine End-of-Line types, it does provide the extra information of how many lines the file has. Technically, the number shown is the number of EOLN's themselves, which is usually 1 less than the number of lines in any actual document (as viewed in an editor). If a file is empty or has no End-of-Line characters in it, <Sans-EOLN> is displayed.
The "Win CRLF", "Unix LF", or "Mac CR" labels may be customised as well, if desired.
Final Thoughts:
Special thanks to BYVoid for hosting, updating, and generally maintaining the uchardet library itself, as well as the multi-lingual Test Encoding files used for verification purposes. All credit goes to him (and, originally, Mozilla) - I just did the repackaging, they do the heavy-lifting.
And thanks to Nikos for pointing me in the right direction about C-compiler name-mangling - nothing is ever what it seems after it's been processed, be it McDonald's hamburgers or programme source-code.
Available Character-Set Label Results:
- International (Unicode)
- · UTF-8
· UTF-16BE / UTF-16LE
· UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
- · ISO-8859-6
· WINDOWS-1256
- · ISO-8859-5
· WINDOWS-1251
- · ISO-2022-CN
· BIG5
· EUC-TW
· GB18030
· HZ-GB-2312
- · ISO-8859-1
· ISO-8859-15
· WINDOWS-1252
- · ASCII
- · ISO-8859-3
- · ISO-8859-1
· ISO-8859-15
· WINDOWS-1252
- · ISO-8859-1
· WINDOWS-1252
- · ISO-8859-7
· WINDOWS-1253
- · ISO-8859-8
· WINDOWS-1255
- · ISO-8859-2
· WINDOWS-1250
- · ISO-2022-JP
· SHIFT_JIS
· EUC-JP
- · ISO-2022-KR
· EUC-KR
- · ISO-8859-5
· KOI8-R
· WINDOWS-1251
· MAC-CYRILLIC
· IBM866
· IBM855
- · ISO-8859-1
· ISO-8859-15
· WINDOWS-1252
- · TIS-620
· ISO-8859-11
- · ISO-8859-3
· ISO-8859-9
- · VISCII
· Windows-1258
- · WINDOWS-1252
- · UTF-8