blog: putting advanced text matching to the test

Post by **nikos** » 2017 Jul 09, 09:16

here's the comment area for today's blog post found at
http://zabkat.com/blog/boyer-moore=sear ... rmance.htm

Kilmatead · Post by **Kilmatead** » 2017 Jul 09, 11:29

nikos wrote:not sure how it would work bytewise on unicode or UTF8 encodings

It's solved (at least in one method) by separately hashing the shift-table rows limited to characters in the pattern itself...

It is easy to build a shift table with the number of rows equal to the number of charset characters if the charset is single-byte. But for Unicode charset, such a table would be quite large and inefficient. It is easy to solve this problem if we notice that table rows differ from each other only for the characters that actually appear in the pattern. So we might store table rows corresponding to characters found in the pattern in some fast-access structure, say a hash-table, and keep a separate set of shift values for all other characters that might appear in the string.

As expected, of course, this adds to the pre-processing time (and in some cases might actually work against the BM-gains for longer byte-sequences, as the hashes multiply), but it's not so unwieldy as to completely derail the train. It could probably be made more efficient by determining the byte-encoding characteristics ahead of time (if constant-encoding was used, all the better), but we all know how sticky-a-wicket that endeavour plays out to be.

nikos wrote:Even with SSD media, reading a file from disk is much slower than working with bytes in RAM, so are the gains of Boyer-Moore lost in a sea of disc access overhead?

If you would learn to stop looking at SSD media with the same linear/sequential-based mindset which spinning-rust drives put into your head 20 years ago, you might begin to see the broader picture. Sure, testing a single-file at a time is a "kind" of benchmark, but as with any empirical study, a subject of 1 does not a test-case make. Take multiple cores, combine the (essentially) non-linear access of solid-state drives so multiple files can be parsed in parallel with little extra of the old-style overhead you're used to, and the slurry pit doesn't smell so bad after awhile.

We will note that this last part is also related to your poor-man's pseudo-multi-threading curio from a few days ago... it pays to remember that Concurrency is not Parallelism (thank you Tony Hoare), so sometimes efficiency is best judged in the eye of the beholder, not strictly subject to the endless catalogues of tabulations economists (and, er, the nightmarish Principia Mathematica itself) are so proud of.

Post by **nikos** » 2017 Jul 09, 14:31

that sounds like quantum computing, intriguing but in the end of the day... wtf!?

as long as we stick to von Neumann architecture and xplorer2 type of tools, we must stick down to earth and consider lowly things like disc access and what have you

as for Unicode, I suppose that you could stick to a 256 entry table and compare bytewise. If double bytes are identical, the component bytes should also match, no? But anyway from what I experienced I don't think I'll be bothering with this algorithm any longer

Kilmatead · Post by **Kilmatead** » 2017 Jul 09, 16:35

nikos wrote:that sounds like quantum computing, intriguing but in the end of the day... wtf!?

Your man was using an example based on gophers burning old computer manuals in the most efficient way possible with an idea from the guy who created q-sort... how much more dumbed-down can it get than that? Gophers & q-sort! This is not quantum computing.

Besides, I liked him because he said the world is not object-oriented, which makes him an ok dude in my book; all you Stroustrupian gophers are diggin' up the wrong garden.

Anyway, the point I was trying to stress is that you're still conceiving of SSD's as being "the same as regular drives, just faster", which is not true (and a horrible simplification of how common-technology has changed the paradigm), and so very few people build that scalability into their products.

Just the other day I was messing around with x2's Checksum [S] column, and was desperately coughing up blood trying to breath while x2 chugged away on a group of 20-30GB files. I could have sworn I saw a little nikos running around inside the GUI waving his arms frantically trying to explain how that column was never intended for such large files. It did not surprise me that that little nikos was also holding up a tiny placard poo-pooing of the multi-core world we live in.

To be fair, I did spend the better part of that same afternoon looking into multi-threaded algorithms for 512-bit hashing and associated collisions, and my brain just kind of melted a bit... so I don't wholly blame you for that. I am, after all, just the gardener around here. You're the professional gopher.

nikos wrote:I suppose that you could stick to a 256 entry table and compare bytewise. If double bytes are identical, the component bytes should also match, no?

Yeah, the component-bytes would match in a presumed byte-consistent universe, but the underlying size would still need to be re-considered separately for quad-byte UTF32 encoding and variable-byte (anywhere from 1-4) UTF8. So multiple tables would still need to be built and maintained anyway. Messy. wchar_t is not the stabilising ecumenical influence MS keeps trying to hoodwink dev's into thinking it is.

zabkat support forum

blog: putting advanced text matching to the test

blog: putting advanced text matching to the test

Re: blog: putting advanced text matching to the test

Re: blog: putting advanced text matching to the test

Re: blog: putting advanced text matching to the test