Wednesday, November 12, 2008

Windows: String Comparison and Sorting

The most common sorting style is code point sorting that is culture insensitive.  This type of sorting doesn't respect the radical order of cultural aspect but it is the fastest sorting order.

For example:

Character 'E' has code point 0x45 and character 'a' has code point 0x61. If we compare or sort the character according to code point, 'E' will show before 'a'.  But this contradict to our knowledge that 'a' should always show before 'E'.

Another example is the Chinese character where it's sorting order depending on phonetics or number of pen strokes.  Sort order according to code point doesn't make much sense for Chinese characters.

The following chart show some Chinese characters sorted by unicode code point that is culture insensitive:

Ideograph 汉语拼音
(Phonetic)
笔划
(Key strokes)
Unicode Code Point
yi 1 0x4E00
ding 2 0x4E01
shang 3 0x4E0A
qie 5 0x4E14
ren 2 0x4EBA

We may use Windows API function CompareString to perform comparison for sorting operation.

var L: DWORD;
    R: integer;
    Str1, Str2: string;
begin
  ...
  // For Stroke Count Order
  L := MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_SIMPLIFIED), SORT_CHINESE_PRC);
  R := CompareString(L, 0, PChar(Str1), Length(Str1), PChar(Str2), Length(Str2));

  // For Phonetic Order
  L := MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_SIMPLIFIED), SORT_CHINESE_PRCP);
  R := CompareString(L, 0, PChar(Str1), Length(Str1), PChar(Str2), Length(Str2));
  ...

  // For Ordinal Comparison (Code point comparison, culture insensitive)
  R := StrComp(PChar(Str1), PChar(Str2));
end;

Stroke Count Order:

Ideograph 汉语拼音
(Phonetic)
笔划
(Key strokes)
Unicode Code Point
yi 1 0x4E00
ding 2 0x4E01
ren 2 0x4EBA
shang 3 0x4E0A
qie 5 0x4E14

Phonetic Order:

Ideograph 汉语拼音
(Phonetic)
笔划
(Key strokes)
Unicode Code Point
ding 2 0x4E01
qie 5 0x4E14
ren 2 0x4EBA
shang 3 0x4E0A
yi 1 0x4E00

Reference:

  1. Sort Order Identifiers
  2. Globalization Step-by-Step
  3. Where is the locale? "Its Invariant." In <i>where</i>?
  4. Comparison confusion: INVARIANT vs. ORDINAL

No comments: