Sorting and Comparing the Excel Way: Code Pages and Collating Sequences – Revised

Comparing 2 values should be easy, but …

Comparing values to see which is greater should be straightforward: you just use the less-than comparison operator: IF(A1<A2,TRUE,FALSE).
So 1 is less than 2 and A is less than B.
And when things are sorted ascending each value is less than the following value.

This all seems obvious and easy, but wait:

Is “AA” less than “AAA”?
Is TRUE less than A?
Is 99 less than B?
Is “11″ less than “2″?
What about special characters (!ӣ$%^&*()@#~ etc) ?
And is “a” less than “A”?
And then there are all those accented characters, how should they be sorted?
And what about Chinese, Japanese, Cyrillic, Arabic, Hebrew … ?
When sorting street names should St. James St be next to Saint James Street?

Code Pages enable computers to represent characters

Inside the computer everything works in binary so the computer designers have to make a choice about how to represent characters in binary, and preferably get everyone to agree about the choices so that one computer can understand another. These choices are known as Code Pages. A single-byte code page has 256 characters corresponding to the 256 binary numbers that can be held in a Byte, and in the early days each country/language tended to have their own code page to represent the character set that they used. Of course that does not work very well when you want to send an email from one country to another, so we started to invent multilingual code pages (500, 850, ISO 8859 etc). And Unicode was developed to handle languages such as Japanese that need more than 256 characters anyway.

Today Windows uses multilingual Code Page 1252 for Latin-1 based (Western European) languages. This is similar to ISO 8859 (but not identical!).

An Excel Formula does Binary Comparison of characters

OK now we have a code page where each character is represented by a binary number. So we can use the binary numbers to compare the characters. This is the binary comparison that an Excel formula like IF(A1<A2,TRUE,FALSE) does. You can test this for yourself:

In column A put numbers 1 to 255 in succeeding rows. Then in column B enter =CHAR(A1) and fill down. This will give you in column B the character equivalent of the number in column A. Then in column C enter =IF(A1<A2, TRUE,FALSE) and fill down. The result in C should always be TRUE, and it is, but ihem points out that more complicated rules seem to apply for some strings of more than one character containing accented characters.

Excel SORT does NOT use Binary Comparison

Now copy the characters in column B and use paste special values to put them in column E. Add the numbers 0 to 9 and TRUE and FALSE at the bottom. Then use Excel’s SORT to sort all the values in column E:

The result of the Excel sort is a very different sequence:

Numbers<Special characters<Numbers as Text<lowercase a followed by upper case A followed by accented a followed by b then B<followed by False followed by True.

Note that the German ß is treated as an accented S.
Blank/Empty cells are always sorted last in both ascending and descending sort.
Also note that Excel SORT ignores apostrophes and hyphens, except that a string that is identical to another string except that it contains hyphens will be sorted after its hyphen-less twin.

This kind of comparison sequence is known as a Collating Sequence. Each combination of language and country (a locale such as EN-US) can have a different collating sequence even when using the same code page. Some collating sequences have to treat 2 characters as if they were one (digraphs – for example Traditional Spanish ch, ll, rr).

Excel’s SORT uses a “stable” sorting algorithm. This means that if there multiple identical items to be sorted then Excel’s SORT will preserve the original order of the sorted items.

Duplicating Excel’s collating sequence in your own QuickSort routine.

If you need to use your own sort routine, for instance to sort arrays using QuickSort, it is difficult to use the same collating sequence as Excel’s SORT.

Excel VBA has 2 different comparison methods (Binary and Text) available through StrComp() or Option Compare. But using VBA Text compare just does a case-insensitive collating compare, not a collating compare. In fact to do a case-sensitive collating comparison using VBA you need to build your own collating sequence table and do a byte-by-byte comparison using lookups in your collating sequence table. Or you could probably use the Windows API function CompareStringW. Or you could dump the array to a worksheet, use Excel Sort and then read the result back into an array.

C++ has string comparison methods (wcscoll(0 and _wcsicoll() for Unicode strings) that will use the collating sequence of the current Locale. I used these to build the COMPARE() function and all the LOOKUP and SORT functions in the FastExcel V3 function library.

Note that the usual QuckSort algorithm is NOT a “stable” sort: identical items will often be rearranged from their original sequence.

When do you need Collating Quicksort?

Most of the time its not a problem to use a binary compare quicksort (and its a lot faster). Some exceptions that won’t work well are:

  • Comparing your own sorted array with an Excel sorted range.
  • Using Excel’s MATCH and LOOKUP functions with approximate match on your own sorted array.
  • Case-sensitive sorting.

So have you ever needed to use a collating Quicksort?

About these ads
This entry was posted in Lookups, UDF, VBA, XLL. Bookmark the permalink.

4 Responses to Sorting and Comparing the Excel Way: Code Pages and Collating Sequences – Revised

  1. lhem says:

    Nice summary. Text ordering can be a tricky and I must admit I don’t fully understand it myself. However I think Excel is more sophisticated than one might expect and is not just doing binary compares in formulas. For example accented characters are sorted in compination with other characters eg “é”>”e” and “éa”<"eb" both return TRUE. VBA with Option Compare Text handles these cases in the same way although there are minor differences for special characters eg superscript numbers.

    Unicode wide format has been the underlying text format for many years in Office and on my system at least follows closely the ISO standard: http://en.wikipedia.org/wiki/European_ordering_rules (Levels 1 and 2 in formulas and levels 3 and 4 in sorting) with minor differences for special characters. I don't know how this is affected by different locales though.

    • fastexcel says:

      @lhem – Thanks: it is strange that Excel formulas use binary comparison for single characters but not for some cases of multiple characters – I don’t understand the logic behind that. And you are right that StrComp Text Compare does (more or less) upper case collate comparisons.

  2. gabiravhon says:

    How do rank() works?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s