Natural sort order

(1) By anonymous on 2020-03-27 14:34:40 [link] [source]

Is there any way to order by "natural sort order" in sqlite?

i.e. z2, z11 instead of the lexicographical z11, z2.

https://en.wikipedia.org/wiki/Natural_sort_order

Thanks, Hamish

(2) By Richard Hipp (drh) on 2020-03-27 15:25:49 in reply to 1 updated by 2.1 [link] [source]

How do PostgreSQL, MySQL, SQL Server, and Oracle do this?

(2.1) By Richard Hipp (drh) on 2020-03-27 15:27:50 edited from 2.0 in reply to 1 [link] [source]

How do PostgreSQL, MySQL, SQL Server, and Oracle do this?

Edit → I mean: What syntax do they use, not how they implement it.

(3) By Stephan Beal (stephan) on 2020-03-27 15:38:00 in reply to 2.1 [link] [source]

Apparently PG doesn't do this natively, but there are a several solutions floating around:

https://github.com/Zeleo/pg_natural_sort_order is a PG extension
https://stackoverflow.com/a/48809832 uses a PGSQL function

Those both use a function, so that the caller does ORDER BY naturalsort(...).

This post:

https://stackoverflow.com/questions/12965463/humanized-or-natural-number-sorting-of-mixed-word-and-number-strings

Has several PG-only takes on the problem.

(4) By Richard Hipp (drh) on 2020-03-27 16:03:50 in reply to 3 updated by 4.1 [link] [source]

Two questions:

First, the PG solutions all seem to define a new SQL function that returns
a sort key.  This is fine, but it seems more "natural" to me to create a
new collating sequence.  So the PG solution is:

~~~
    SELECT * FROM user ORDER BY naturalsort(name);
~~~

But wouldn't it be nicer to do:

~~~
    SELECT * FROM user ORDER BY name COLLATE natural;
~~~

Second thing:
I prefer the idea of "dictionary" sort order.
Dictionary order does the same magic with integers, but
also mixes case differences so that "a" comes after "A" but
before "B" rather than after "Z".  It seems like this is
even more natural that just putting numeric parts in numeric
order, does it not?

================================
| BINARY | &nbsp; | DICTIONARY |
| B41    | &nbsp; | a5         |
| C419   | &nbsp; | B41        |
| C421   | &nbsp; | C419       |
| a5     | &nbsp; | c420       |
| c420   | &nbsp; | C421       |
| d1234  | &nbsp; | d234       |
| d34    | &nbsp; | d1234      |
================================

If we add something like this to SQLite, I would want it to be
full "dictionary" order not "natural" order.  In other words, I
would want it to take case into account as a tie-breaker.

(5) By ddevienne on 2020-03-27 16:40:32 in reply to 4.0 updated by 5.1 [link] [source]

FWIW, we've had a `collate NATURAL_ORDER` in our app for years.  
Windows Explorer uses *Natural Ordering*, and many users as used  
to it, and request it of software vendors.

We support both case-sensitive and case-insensitive versions,  
but not your *Dictionary Order*, which is a mix of both. Interesting.  
I never considered that particular ordering.

(5.1) By ddevienne on 2020-03-27 16:42:14 edited from 5.0 in reply to 4.0 [link] [source]

FWIW, we've had a collate NATURAL_ORDER in our app for years.
Windows Explorer uses Natural Ordering, and many users as used
to it, and request it of software vendors.

We support both case-sensitive and case-insensitive versions,
but not your Dictionary Order, which is a mix of both. Interesting.
I never considered that particular ordering.

PS: naive implementations convert to integers, and risk overflow,
while it's relatively easy to support arbitrarily long integers.

(6) By Hamish Allan (hatfinch) on 2020-03-27 16:45:07 in reply to 4.0 [link] [source]

Both of these seem like a good idea to me.

On the naming of "dictionary sort order", though, I couldn't find any definition by that name in a brief online search, and the Wikipedia entry for lexicographical ordering lists it as a synonym, which might be confusing.

(4.1) By Richard Hipp (drh) on 2020-03-27 16:51:16 edited from 4.0 in reply to 3 [link] [source]

Two questions:

First, the PG solutions all seem to define a new SQL function that returns a sort key. This is fine, but it seems more "natural" to me to create a new collating sequence. So the PG solution is:

    SELECT * FROM user ORDER BY naturalsort(name);

But wouldn't it be nicer to do:

    SELECT * FROM user ORDER BY name COLLATE natural;

Second thing: I prefer the idea of "dictionary" sort order. Dictionary order does the same magic with integers, but also mixes case differences so that "a" comes after "A" but before "B" rather than after "Z". It seems like this is even more natural that just putting numeric parts in numeric order, does it not?

BINARY		DICTIONARY
B41		a5
C419		B41
C421		C419
a5		c420
c420		C421
d1234		d34
d234		d234
d34		d1234

If we add something like this to SQLite, I would want it to be full "dictionary" order not "natural" order. In other words, I would want it to take case into account as a tie-breaker.

(7) By Gerry Snyder (GSnyder) on 2020-03-27 17:39:35 in reply to 4.1 [link] [source]

I don't suppose that "dictionary" order could include accented letters too, so that é would sort next to e?

(8) By Simon Slavin (slavin) on 2020-03-27 18:12:13 in reply to 1 [link] [source]

You can define your own collation function which does whatever you like.

https://sqlite.org/c3ref/create_collation.html

Call it, presumably, 'NORMAL'.

If you do this, you might want to make sure it can cope with triplets like this:

book11page8
book3page4
book11page17

floor2room16b
floor1room7
floor2room16

16.52.124.48
82.42.52.145
127.0.0.1

and not just digits at the end of the string.

(9) By Andreas Kupries (andreas-kupries) on 2020-03-27 18:25:06 in reply to 5.1 [link] [source]

The Tcl implementation of dictionary sorting does not convert to integers. The link references the DictionaryCompare function called by the higher framework to compare two strings.

(10) By Richard Hipp (drh) on 2020-03-27 18:50:10 in reply to 8 [link] [source]

Why the name "NORMAL"?

Note that "NATURAL" is also not a good name since "NATURAL" is a keyword in SQL - specifically as part of the "NATURAL JOIN" syntax. Suggestions for alternative names?

(11) By Richard Hipp (drh) on 2020-03-27 19:04:16 in reply to 1 [link] [source]

The code below implements a run-time loadable extension for SQLite that implements a "MIXED" collating sequence that does what you want, I think. I hereby toss the code over the wall to the community for testing, criticism, and discussion. I'm particularly interested in suggests for a name that is better than "MIXED".

To compile following the instructions on compiling loadable extensions found in the documentation.

To load the extension from the [command-line shell] put the shared library that you compiled in the working directory and enter:

.load ./mixedcoll

To use it, just add "COLLATE mixed" after the ORDER BY clause terms that you want sorted in "natural" order. Or add COLLATE mixed to terms on your indexes, or on the columns of your table definitions.

The Code:

/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts (unsigned) integers embedded 
** in the middle of text in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int mixedCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i, x;
  for(i=0; i<nKey1 && i<nKey2; i++){
    x = zA[i] - zB[i];
    if( x!=0 ){
      int j;
      for(j=i; j<nKey1 && j<nKey2 && isdigit(zA[j]) && isdigit(zB[j]); j++){}
      if( j<nKey1 && isdigit(zA[j]) ){
        return +1;
      }else if( j<nKey2 && isdigit(zB[j]) ){
        return -1;
      }else{
        return x;
      }
    }
  }
  return nKey1 - nKey2;
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_mixedcoll_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "mixed", SQLITE_UTF8, 0, mixedCollFunc);
  return rc;
}

(12) By Richard Hipp (drh) on 2020-03-29 00:55:26 in reply to 1 [link] [source]

Here is a revised "natural sort" collating function. The name is now changed to "NATSORT", because I found a PHP and a Python module that also use that name.

The previous implementation sorted embedded unsigned integers in numeric order and last. This implementation sorts embedded unsigned integers in numeric order, but otherwise in ASCII order.

Please test and report back if this collating sequence is useful. If it is, perhaps it might land in the next release of SQLite.

The Code

/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts embedded unsigned integers
** in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int natSortCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i, x;
  for(i=0; i<nKey1 && i<nKey2; i++){
    x = zA[i] - zB[i];
    if( x!=0 ){
      int j;
      for(j=i; j<nKey1 && j<nKey2 && isdigit(zA[j]) && isdigit(zB[j]); j++){}
      if( i==j && (i==0 || !isdigit(zA[i-1])) ){
        return x;
      }else if( j<nKey1 && isdigit(zA[j]) ){
        return +1;
      }else if( j<nKey2 && isdigit(zB[j]) ){
        return -1;
      }else{
        return x;
      }
    }
  }
  return nKey1 - nKey2;
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_natsort_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "natsort", SQLITE_UTF8, 0, natSortCollFunc);
  return rc;
}

(13) By Richard Hipp (drh) on 2020-03-29 01:03:55 in reply to 12 [link] [source]

The two functions above might return surprising results for integers with leading zeros.

    SELECT 'x123y' < 'x00005y' COLLATE natsort;

The above returns true. Should it? Or should the answer be false? What should this return:

    SELECT 'x123y' == 'x0123y' COLLATE natsort;

(14) By Clemens Ladisch (cladisch) on 2020-03-29 07:53:31 in reply to 13 [link] [source]

It is definitely more "natural" to ignore leading zeros.

https://natsort.readthedocs.io/en/master/howitworks.html talks about the design decisions made in the Python module. While I doubt that you'd want to support all those features, at least ignoring case would be useful.

(15) By anonymous on 2020-03-29 09:35:18 in reply to 14 [link] [source]

"It is definitely more "natural" to ignore leading zeros."

Unless you are sorting US ZIP codes, for example.

(16) By Richard Hipp (drh) on 2020-03-29 20:44:31 in reply to 1 [link] [source]

The lastest version of NATSORT (included below) sorts unsigned numbers in numeric order, correctly ignoring leading zeros.

Please try out the new code. Report any issues you find, and also report whether or not you are likely to find this collating sequence useful.

The code:

/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts embedded unsigned integers
** in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>
#include <string.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int natSortCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i=0, j=0, x;
  while( i<nKey1 && j<nKey2 ){
    x = zA[i] - zB[j];
    if( isdigit(zA[i]) ){
      int k;
      if( !isdigit(zB[j]) ) return x;
      while( zA[i]=='0' && i<nKey1 ){ i++; }
      while( zB[j]=='0' && j<nKey2 ){ j++; }
      k = 0;
      while( i+k<nKey1 && isdigit(zA[i+k]) && j+k<nKey2 && isdigit(zB[j+k]) ){
        k++;
      }
      if( i+k<nKey1 && isdigit(zA[i+k]) ){
        return +1;
      }else if( j+k<nKey2 && isdigit(zB[j+k]) ){
        return -1;
      }else{
        x = memcmp(zA+i, zB+j, k);
        if( x ) return x;
        i += k;
        j += k;
      }
    }else if( x ){
      return x;
    }else{
      i++;
      j++;
    }
  }
  return (nKey1 - i) - (nKey2 - j);
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_natsort_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "natsort", SQLITE_UTF8, 0, natSortCollFunc);
  return rc;
}

(17) By Mark Lawrence (mark) on 2020-03-29 21:45:18 in reply to 16 [link] [source]

I haven't tried out the code (yet) but I would definately use this collation as a replacement where I am currently relying on a user-defined function.

(18) By ddevienne on 2020-03-30 08:20:11 in reply to 16 [link] [source]

FWIW, my impl uses the trick below, to avoid any isdigit() in many cases. --DD

const int diff = l_char - r_char;
if (-9 > diff || diff > 9) {
  // *both* chars cannot be digits, so normal lexicographical compare,
  // even for the mixed char-digit or digit-char cases.
  return diff;
}

Our impls are essentially the same (including skipping leading zeros),
I just didn't think of using memcmp() myself, I reloop on the chars.

(19) By Neal (_Neal_) on 2020-04-02 21:09:02 in reply to 10 [link] [source]

Suggestions for alternative names?

GNU coreutils (sort and ls) have a similar -v option called "Version Sort"(https://www.gnu.org/software/coreutils/manual/html_node/Version-sort-overview.html)

(20) By Peter da Silva (resuna) on 2020-04-03 14:14:19 in reply to 7 [link] [source]

That opens up Unicode collation and localization and that's a whole new can of worms.

(21) By Peter da Silva (resuna) on 2020-04-03 14:16:06 in reply to 15 [source]

You wouldn't use this sort on zip codes, would you?

(22) By Hamish Allan (hatfinch) on 2020-04-03 14:35:33 in reply to 21 [link] [source]

Wouldn't it work anyway, because you'd only be comparing numbers of the same fixed length? (xxxxx or xxxxx-yyyy)

(23) By Dave B (DaveBlake) on 2020-05-06 17:33:57 in reply to 16 [link] [source]

Richard having natural sorting of this kind available as a standard collation in SQLite would be useful.

Actually I am working on a bespoke collation that handles number naturally like this, is case insensitive and does some simple accent folding so 'a', 'á' and 'À' are with 'A'. However the result is slower than I would like, and I think that is partly because of the way I am converting from the const void data of the collation callback function into wstring before processing compare even though most of the data is simple ascii.Need to give that more thought.

(24) By TripeHound on 2020-05-06 21:20:42 in reply to 23 [link] [source]

simple accent folding so 'a', 'á' and 'À' are with 'A'

Just be aware that in some languages it is not (always) correct to simply "fold" (i.e. ignore) accents when sorting (for example, in Swedish – apparently – the letters 'å', 'ä', and 'ö' are considered distinct letters, not 'a' and 'o' with diacritics). See this answer on StackExchange or Diacritic on Wikipedia.

(25) By Dave B (DaveBlake) on 2020-05-06 21:47:57 in reply to 24 [link] [source]

Thanks TripeHound, I am aware that "accent folding" is not the same as full ICU locale specific collation. However the need is to reproduce the same collation using SQLlite as utf8_general_ci in MySQL (now called utf8mb3_general_ci in MariaDB). That is just simple accent folding.

(26) By anonymous on 2021-10-07 16:44:11 in reply to 16 [link] [source]

Hi Richard, I'm using this as a loadable extension and it's working great. Is there any update on whether this will ship in SQLite without having to use a loadable module any time soon?

(27) By ddevienne on 2022-02-07 13:02:36 in reply to 16 [link] [source]

Resurrecting this old thread, as I've found an interesting tidbit on NatSort.

Richard's code considers a010 and a0010 equal, which is logical, but
which implies there is no relative order between names that differ only in the number of leading zeros.
Which makes the order thus non-deterministic for those specifically.

Completely ignoring leading-zeros, when the trailing number differ, is OK.
But when they are equal, and there's no remaining chars (or they are equal)
I've discovered today our impl orders based on the leading-zero prefix.

Whether or not logically equal names that differ only in leading zeros should be equal or not is debatable.
I just wanted to point out this particular behavior, in case it matters to someone.

NatSort's less-than semantic is obvious and intuitive.
But it's equals semantic, not so much...

Therefore I'm interested in hearing POVs on this forum.

(28) By Larry Brasfield (larrybr) on 2022-02-07 13:22:19 in reply to 27 [link] [source]

It is better for logically equivalent things which are not identical to sort on the attributes that do not bear on logical equivalence. Often, it does not matter. But sometimes it does, and then there is no good substitute beyond re-implementing. Because the test for not-logical difference is at the end of the if-then-elseif chain, having the extra comparison(s) should take little time.

(29) By anonymous on 2022-02-07 15:20:56 in reply to 27 [link] [source]

Treat the number of leading zeroes of a group of digits the same way as the case of an alphabetic character: a tiebreaker to be considered only when the entire strings compare equal otherwise.

(30) By Bill Wade (billwade) on 2022-02-08 03:40:34 in reply to 27 [link] [source]

I've found it convenient (relatively easy to explain) to use something very close to memcmp as a final, deterministic, tie-breaker.

Deciding what to pass to memcmp() might still be up to debate (perhaps use unicode normalization first, and pass the normalized strings to memcmp, and if that results in a tie, you pass the raw inputs to memcmp as a final test).

I would be inclined to say that for strings, leading zero's break equality, even when upper/lower don't. Windows does natsort, but two strings with different numbers of zero's refer to different files, not aliases for the same file.

(31) By MBL (UserMBL) on 2022-11-05 11:00:18 in reply to 16 [link] [source]

I would like to pick up this thread to further discuss sorting behavior.

I could not find any restriction to combine COLLATE phrases and my test shows also no error in repeating the COLLATE phrase; however, it looks like that only the last in sequence wins and the others are ignored.

My intention with the following test was to have a natsorted order which is independent on the case.

Consider the following outcome:

with user( name ) as (
     values('BINARY'),('DICTIONARY'),('B412'),('a5'),('C4190'),('B41'),('C421'),('C419'),('1'),('2'),('10')
          ,('a5'),('c420'),('c420'),('C421'),('d1234'),('d345'),('d234'),('d234'),('d34'),('D1234')
), natsorted as ( SELECT 'natsort' as sorted, name FROM user ORDER BY name COLLATE natsort
), nocased as ( SELECT 'nocase' as sorted, name FROM user ORDER BY name COLLATE nocase
), natsort_nocase as ( SELECT 'natsort nocase' as sorted, name FROM user ORDER BY name COLLATE natsort COLLATE nocase
), nocase_natsort as ( SELECT 'nocase natsort' as sorted, name FROM user ORDER BY name COLLATE nocase COLLATE natsort
)
select 'unordered' as sorted, name from user
union all
select * from natsorted
union all
select * from nocased
union all
select * from natsort_nocase
union all
select * from nocase_natsort;

For easier comparison I used a .mode of columned view, which the CLI does not support (yet?) with a block editor:

SQLite3x\Win32\Debug>sqlite3.exe ..\..\sandbox.SQB
SQLite version 3.39.4 2022-09-29 15:55:41
Enter ".help" for usage hints.
sqlite> .load sqlite3.dll
sqlite> select version();
basix v3.39.4x.5362.100+expat_2.1.0 @ Nov  4 2022-15:09:44
sqlite> .mode box
sqlite> select * from Collation_Test;
┌────────────────┬────────────┐   ┌────────────────┬────────────┐   ┌────────────────┬────────────┐   ┌────────────────┬────────────┐   ┌────────────────┬────────────┐
│     sorted     │    name    │   │     sorted     │    name    │   │     sorted     │    name    │   │     sorted     │    name    │   │     sorted     │    name    │
├────────────────┼────────────┤   ├────────────────┼────────────┤   ├────────────────┼────────────┤   ├────────────────┼────────────┤   ├────────────────┼────────────┤
│ unordered      │ BINARY     │   │ natsort        │ 1          │   │ nocase         │ 1          │   │ natsort nocase │ 1          │   │ nocase natsort │ 1          │
│ unordered      │ DICTIONARY │   │ natsort        │ 2          │   │ nocase         │ 10         │   │ natsort nocase │ 10         │   │ nocase natsort │ 2          │
│ unordered      │ B412       │   │ natsort        │ 10         │   │ nocase         │ 2          │   │ natsort nocase │ 2          │   │ nocase natsort │ 10         │
│ unordered      │ a5         │   │ natsort        │ B41        │   │ nocase         │ a5         │   │ natsort nocase │ a5         │   │ nocase natsort │ B41        │
│ unordered      │ C4190      │   │ natsort        │ B412       │   │ nocase         │ a5         │   │ natsort nocase │ a5         │   │ nocase natsort │ B412       │
│ unordered      │ B41        │   │ natsort        │ BINARY     │   │ nocase         │ B41        │   │ natsort nocase │ B41        │   │ nocase natsort │ BINARY     │
│ unordered      │ C421       │   │ natsort        │ C419       │   │ nocase         │ B412       │   │ natsort nocase │ B412       │   │ nocase natsort │ C419       │
│ unordered      │ C419       │   │ natsort        │ C421       │   │ nocase         │ BINARY     │   │ natsort nocase │ BINARY     │   │ nocase natsort │ C421       │
│ unordered      │ 1          │   │ natsort        │ C421       │   │ nocase         │ C419       │   │ natsort nocase │ C419       │   │ nocase natsort │ C421       │
│ unordered      │ 2          │   │ natsort        │ C4190      │   │ nocase         │ C4190      │   │ natsort nocase │ C4190      │   │ nocase natsort │ C4190      │
│ unordered      │ 10         │   │ natsort        │ D1234      │   │ nocase         │ c420       │   │ natsort nocase │ c420       │   │ nocase natsort │ D1234      │
│ unordered      │ a5         │   │ natsort        │ DICTIONARY │   │ nocase         │ c420       │   │ natsort nocase │ c420       │   │ nocase natsort │ DICTIONARY │
│ unordered      │ c420       │   │ natsort        │ a5         │   │ nocase         │ C421       │   │ natsort nocase │ C421       │   │ nocase natsort │ a5         │
│ unordered      │ c420       │   │ natsort        │ a5         │   │ nocase         │ C421       │   │ natsort nocase │ C421       │   │ nocase natsort │ a5         │
│ unordered      │ C421       │   │ natsort        │ c420       │   │ nocase         │ d1234      │   │ natsort nocase │ d1234      │   │ nocase natsort │ c420       │
│ unordered      │ d1234      │   │ natsort        │ c420       │   │ nocase         │ D1234      │   │ natsort nocase │ D1234      │   │ nocase natsort │ c420       │
│ unordered      │ d345       │   │ natsort        │ d34        │   │ nocase         │ d234       │   │ natsort nocase │ d234       │   │ nocase natsort │ d34        │
│ unordered      │ d234       │   │ natsort        │ d234       │   │ nocase         │ d234       │   │ natsort nocase │ d234       │   │ nocase natsort │ d234       │
│ unordered      │ d234       │   │ natsort        │ d234       │   │ nocase         │ d34        │   │ natsort nocase │ d34        │   │ nocase natsort │ d234       │
│ unordered      │ d34        │   │ natsort        │ d345       │   │ nocase         │ d345       │   │ natsort nocase │ d345       │   │ nocase natsort │ d345       │
│ unordered      │ D1234      │   │ natsort        │ d1234      │   │ nocase         │ DICTIONARY │   │ natsort nocase │ DICTIONARY │   │ nocase natsort │ d1234      │
                                                                                                                                        └────────────────┴────────────┘
sqlite>

The joined collations ended up with just the last in sequence was used. Is this intended or should it be improved? Expected output would have been:

 ┌────────────────┬────────────┐
 │     sorted     │    name    │
 ├────────────────┼────────────┤
 │ nocase natsort │ 1          │
 │ nocase natsort │ 2          │
 │ nocase natsort │ 10         │
 │ nocase natsort │ a5         │
 │ nocase natsort │ a5         │
 │ nocase natsort │ B41        │
 │ nocase natsort │ B412       │
 │ nocase natsort │ BINARY     │
 │ nocase natsort │ C419       │
 │ nocase natsort │ c420       │
 │ nocase natsort │ c420       │
 │ nocase natsort │ C421       │
 │ nocase natsort │ C421       │
 │ nocase natsort │ C4190      │
 │ nocase natsort │ d34        │
 │ nocase natsort │ d234       │
 │ nocase natsort │ d234       │
 │ nocase natsort │ d345       │
 │ nocase natsort │ D1234      │
 │ nocase natsort │ d1234      │
 │ nocase natsort │ DICTIONARY │
 └────────────────┴────────────┘

Does the syntax diagram for the ordering-term of the select statement contain a bug or the parser, which allows repetition where the syntax would not allow it?

How to obtain a combined sorting without having to use a CTE to upper(name) that should be natsort'ed ? (case insensitive natsorting)

(32) By Aask (AAsk1902) on 2022-11-05 14:20:08 in reply to 31 updated by 32.1 [link] [source]

If

>My intention with the following test was to have a natsorted order which is independent on the case.

and your 

```
Expected output would have been:

 ┌────────────────┬────────────┐
 │     sorted     │    name    │
 ├────────────────┼────────────┤
 │ nocase natsort │ 1          │
 │ nocase natsort │ 2          │
 │ nocase natsort │ 10         │
 │ nocase natsort │ a5         │
 │ nocase natsort │ a5         │
 │ nocase natsort │ B41        │
 │ nocase natsort │ B412       │
 │ nocase natsort │ BINARY     │
 │ nocase natsort │ C419       │
 │ nocase natsort │ c420       │
 │ nocase natsort │ c420       │
 │ nocase natsort │ C421       │
 │ nocase natsort │ C421       │
 │ nocase natsort │ C4190      │
 │ nocase natsort │ d34        │
 │ nocase natsort │ d234       │
 │ nocase natsort │ d234       │
 │ nocase natsort │ d345       │
 │ nocase natsort │ D1234      │
 │ nocase natsort │ d1234      │
 │ nocase natsort │ DICTIONARY │
 └────────────────┴────────────┘
``` 

Why no simply sort by lower/upper case NAME?

```
sqlite> WITH tblUSER (NAME)
   ...> AS (
   ...>         VALUES ('BINARY'),
   ...>                 ('DICTIONARY'),
   ...>                 ('B412'),
   ...>                 ('a5'),
   ...>                 ('C4190'),
   ...>                 ('B41'),
   ...>                 ('C421'),
   ...>                 ('C419'),
   ...>                 ('1'),
   ...>                 ('2'),
   ...>                 ('10'),
   ...>                 ('a5'),
   ...>                 ('c420'),
   ...>                 ('c420'),
   ...>                 ('C421'),
   ...>                 ('d1234'),
   ...>                 ('d345'),
   ...>                 ('d234'),
   ...>                 ('d234'),
   ...>                 ('d34'),
   ...>                 ('D1234')
   ...>         )
   ...> SELECT [NAME]
   ...> FROM tblUSER
   ...> ORDER BY LOWER([NAME]);
NAME
----------
1
10
2
a5
a5
B41
B412
BINARY
C419
C4190
c420
c420
C421
C421
d1234
D1234
d234
d234
d34
d345
DICTIONARY
sqlite>
```

(32.1) By Aask (AAsk1902) on 2022-11-05 14:21:24 edited from 32.0 in reply to 31 updated by 32.2 [link] [source]

Deleted

(32.2) By Aask (AAsk1902) on 2022-11-05 15:43:02 edited from 32.1 in reply to 31 [link] [source]

Deleted

(33) By Keith Medcalf (kmedcalf) on 2022-11-05 17:10:17 in reply to 31 [link] [source]

A collation is used for comparison of two items to determine whether one is bigger than the other, or they are equal. The underlying table data is not modified during this process.

Therefore, you cannot "stack" collations. If you specified something like:

select 'jimmy' collate nocase collate natsort collate unicode;

then even if all the collations where processed in order specified (left to right), then only the last collation would have any affect because each of the collations acts on the same underlying data -- the data is not modified by the nocase collation then the nocased data passed to natsort which does its mutilations and passes its mutilated data to the unicode collation.

The collate keyword attaches a collation sequence (how to compare) to a "value". If there are multiple collation sequences attached to "an expression" then there are documented rules about which one is used. Ony one collation is ever used at a time. Each "attachment" of a collating sequence to a value (via the collate keyword) "detaches and discards from the value" any previously attached collation.

That is:

select a, b from t1 where a collate nocase collate natsort == b collate natsort collate nocase;

will use one and only one collation when doing the comparison a == b, and the values a and b will not be modified. This will use the natsort collation because each collate clause replaces the previously specified collation, and preference is given to the collation attached to the left-most argument.

See https://sqlite.org/datatype3.html for documentation detail.

(34) By Keith Medcalf (kmedcalf) on 2022-11-05 17:17:04 in reply to 31 [link] [source]

And also, everywhere than a collate clause may be attached, it may be attached multiple times, with the last one specified having effect. The diagrams do not indicate this -- they only indicate that one collate clause can be used, it doe snot appear to be documented that you may repeat the collate clause with only the last one having effect.

That is create table x(x text collate nocase collate natsort collate unicode); will attach the collation called unicode to the column x of table x.

(35) By Larry Brasfield (larrybr) on 2022-11-05 17:23:33 in reply to 33 [link] [source]

Reinforcing the result of "stacking" collations as Keith explains: The operator precedence table shows which of multiple collation specs finally takes effect. Note that "COLLATE <collation name>" acts as a postfix operator syntactically and, per the text below the table, the last one wins semantically.

(36) By Larry Brasfield (larrybr) on 2022-11-05 17:30:00 in reply to 34 [link] [source]

The diagrams do not indicate this -- they only indicate that one collate clause can be used ...

Not so. The expression syntax shows that <expr> COLLATE (collation-name) yields another <expr>. That <expr> can, of course, participate in another expression appending a collate clause, ad infinitum.

(37) By Keith Medcalf (kmedcalf) on 2022-11-05 17:32:46 in reply to 36 [link] [source]

Yes, you are, of course, correct!

(38) By cj (sqlitening) on 2022-11-07 04:26:25 in reply to 1 [link] [source]

Just realized this is an old thread with a solution.
This would have been the way I would have done it in an accounting program.

create table if not exists t1(category text collate nocase, code integer)
// create index i on t1(category,code)
insert into t1 values('Z',2)
insert into t1 values('z',1)
insert into t1 values('Z',11)
select category||code from t1 order by category,code
z1
Z2
Z11

(39) By Hamish Allan (hatfinch) on 2023-11-17 11:23:36 in reply to 26 [link] [source]

Bumping this question!