SQLite Forum

Natural sort order
Login

Natural sort order

(1) By anonymous on 2020-03-27 14:34:40 [link] [source]

Is there any way to order by "natural sort order" in sqlite?

i.e. z2, z11 instead of the lexicographical z11, z2.

https://en.wikipedia.org/wiki/Natural_sort_order

Thanks, Hamish

(2.1) By Richard Hipp (drh) on 2020-03-27 15:27:50 edited from 2.0 in reply to 1 [link] [source]

How do PostgreSQL, MySQL, SQL Server, and Oracle do this?

Edit → I mean: What syntax do they use, not how they implement it.

(3) By Stephan Beal (stephan) on 2020-03-27 15:38:00 in reply to 2.1 [link] [source]

Apparently PG doesn't do this natively, but there are a several solutions floating around:

Those both use a function, so that the caller does ORDER BY naturalsort(...).

This post:

Has several PG-only takes on the problem.

(4.1) By Richard Hipp (drh) on 2020-03-27 16:51:16 edited from 4.0 in reply to 3 [link] [source]

Two questions:

First, the PG solutions all seem to define a new SQL function that returns a sort key. This is fine, but it seems more "natural" to me to create a new collating sequence. So the PG solution is:

    SELECT * FROM user ORDER BY naturalsort(name);

But wouldn't it be nicer to do:

    SELECT * FROM user ORDER BY name COLLATE natural;

Second thing: I prefer the idea of "dictionary" sort order. Dictionary order does the same magic with integers, but also mixes case differences so that "a" comes after "A" but before "B" rather than after "Z". It seems like this is even more natural that just putting numeric parts in numeric order, does it not?

BINARY   DICTIONARY
B41   a5
C419   B41
C421   C419
a5   c420
c420   C421
d1234   d34
d234   d234
d34   d1234

If we add something like this to SQLite, I would want it to be full "dictionary" order not "natural" order. In other words, I would want it to take case into account as a tie-breaker.

(5.1) By ddevienne on 2020-03-27 16:42:14 edited from 5.0 in reply to 4.0 [link] [source]

FWIW, we've had a collate NATURAL_ORDER in our app for years.
Windows Explorer uses Natural Ordering, and many users as used
to it, and request it of software vendors.

We support both case-sensitive and case-insensitive versions,
but not your Dictionary Order, which is a mix of both. Interesting.
I never considered that particular ordering.

PS: naive implementations convert to integers, and risk overflow,
while it's relatively easy to support arbitrarily long integers.

(6) By Hamish Allan (hatfinch) on 2020-03-27 16:45:07 in reply to 4.0 [link] [source]

Both of these seem like a good idea to me.

On the naming of "dictionary sort order", though, I couldn't find any definition by that name in a brief online search, and the Wikipedia entry for lexicographical ordering lists it as a synonym, which might be confusing.

(7) By Gerry Snyder (GSnyder) on 2020-03-27 17:39:35 in reply to 4.1 [link] [source]

I don't suppose that "dictionary" order could include accented letters too, so that é would sort next to e?

(8) By Simon Slavin (slavin) on 2020-03-27 18:12:13 in reply to 1 [link] [source]

You can define your own collation function which does whatever you like.

https://sqlite.org/c3ref/create_collation.html

Call it, presumably, 'NORMAL'.

If you do this, you might want to make sure it can cope with triplets like this:

book11page8
book3page4
book11page17

floor2room16b
floor1room7
floor2room16

16.52.124.48
82.42.52.145
127.0.0.1

and not just digits at the end of the string.

(9) By Andreas Kupries (andreas-kupries) on 2020-03-27 18:25:06 in reply to 5.1 [link] [source]

The Tcl implementation of dictionary sorting does not convert to integers. The link references the DictionaryCompare function called by the higher framework to compare two strings.

(10) By Richard Hipp (drh) on 2020-03-27 18:50:10 in reply to 8 [link] [source]

Why the name "NORMAL"?

Note that "NATURAL" is also not a good name since "NATURAL" is a keyword in SQL - specifically as part of the "NATURAL JOIN" syntax. Suggestions for alternative names?

(11) By Richard Hipp (drh) on 2020-03-27 19:04:16 in reply to 1 [link] [source]

The code below implements a run-time loadable extension for SQLite that implements a "MIXED" collating sequence that does what you want, I think. I hereby toss the code over the wall to the community for testing, criticism, and discussion. I'm particularly interested in suggests for a name that is better than "MIXED".

To compile following the instructions on compiling loadable extensions found in the documentation.

To load the extension from the [command-line shell] put the shared library that you compiled in the working directory and enter:

.load ./mixedcoll

To use it, just add "COLLATE mixed" after the ORDER BY clause terms that you want sorted in "natural" order. Or add COLLATE mixed to terms on your indexes, or on the columns of your table definitions.

The Code:

/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts (unsigned) integers embedded 
** in the middle of text in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int mixedCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i, x;
  for(i=0; i<nKey1 && i<nKey2; i++){
    x = zA[i] - zB[i];
    if( x!=0 ){
      int j;
      for(j=i; j<nKey1 && j<nKey2 && isdigit(zA[j]) && isdigit(zB[j]); j++){}
      if( j<nKey1 && isdigit(zA[j]) ){
        return +1;
      }else if( j<nKey2 && isdigit(zB[j]) ){
        return -1;
      }else{
        return x;
      }
    }
  }
  return nKey1 - nKey2;
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_mixedcoll_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "mixed", SQLITE_UTF8, 0, mixedCollFunc);
  return rc;
}

(12) By Richard Hipp (drh) on 2020-03-29 00:55:26 in reply to 1 [link] [source]

Here is a revised "natural sort" collating function. The name is now changed to "NATSORT", because I found a PHP and a Python module that also use that name.

The previous implementation sorted embedded unsigned integers in numeric order and last. This implementation sorts embedded unsigned integers in numeric order, but otherwise in ASCII order.

Please test and report back if this collating sequence is useful. If it is, perhaps it might land in the next release of SQLite.

The Code


/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts embedded unsigned integers
** in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int natSortCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i, x;
  for(i=0; i<nKey1 && i<nKey2; i++){
    x = zA[i] - zB[i];
    if( x!=0 ){
      int j;
      for(j=i; j<nKey1 && j<nKey2 && isdigit(zA[j]) && isdigit(zB[j]); j++){}
      if( i==j && (i==0 || !isdigit(zA[i-1])) ){
        return x;
      }else if( j<nKey1 && isdigit(zA[j]) ){
        return +1;
      }else if( j<nKey2 && isdigit(zB[j]) ){
        return -1;
      }else{
        return x;
      }
    }
  }
  return nKey1 - nKey2;
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_natsort_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "natsort", SQLITE_UTF8, 0, natSortCollFunc);
  return rc;
}

(13) By Richard Hipp (drh) on 2020-03-29 01:03:55 in reply to 12 [link] [source]

The two functions above might return surprising results for integers with leading zeros.

    SELECT 'x123y' < 'x00005y' COLLATE natsort;

The above returns true. Should it? Or should the answer be false? What should this return:

    SELECT 'x123y' == 'x0123y' COLLATE natsort;

(14) By Clemens Ladisch (cladisch) on 2020-03-29 07:53:31 in reply to 13 [link] [source]

It is definitely more "natural" to ignore leading zeros.

https://natsort.readthedocs.io/en/master/howitworks.html talks about the design decisions made in the Python module. While I doubt that you'd want to support all those features, at least ignoring case would be useful.

(15) By anonymous on 2020-03-29 09:35:18 in reply to 14 [link] [source]

"It is definitely more "natural" to ignore leading zeros."

Unless you are sorting US ZIP codes, for example.

(16) By Richard Hipp (drh) on 2020-03-29 20:44:31 in reply to 1 [link] [source]

The lastest version of NATSORT (included below) sorts unsigned numbers in numeric order, correctly ignoring leading zeros.

Please try out the new code. Report any issues you find, and also report whether or not you are likely to find this collating sequence useful.

The code:


/*
** 2020-03-27
**
** The author disclaims copyright to this source code.  In place of
** a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
******************************************************************************
**
** Implement a collating sequence that sorts embedded unsigned integers
** in numeric order.
*/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <ctype.h>
#include <string.h>

/*
** Collating function that compares text byte-by-byte but compares
** digits in numeric order.
*/
static int natSortCollFunc(
  void *notUsed,
  int nKey1, const void *pKey1,
  int nKey2, const void *pKey2
){
  const unsigned char *zA = (const unsigned char*)pKey1;
  const unsigned char *zB = (const unsigned char*)pKey2;
  int i=0, j=0, x;
  while( i<nKey1 && j<nKey2 ){
    x = zA[i] - zB[j];
    if( isdigit(zA[i]) ){
      int k;
      if( !isdigit(zB[j]) ) return x;
      while( zA[i]=='0' && i<nKey1 ){ i++; }
      while( zB[j]=='0' && j<nKey2 ){ j++; }
      k = 0;
      while( i+k<nKey1 && isdigit(zA[i+k]) && j+k<nKey2 && isdigit(zB[j+k]) ){
        k++;
      }
      if( i+k<nKey1 && isdigit(zA[i+k]) ){
        return +1;
      }else if( j+k<nKey2 && isdigit(zB[j+k]) ){
        return -1;
      }else{
        x = memcmp(zA+i, zB+j, k);
        if( x ) return x;
        i += k;
        j += k;
      }
    }else if( x ){
      return x;
    }else{
      i++;
      j++;
    }
  }
  return (nKey1 - i) - (nKey2 - j);
}


#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_natsort_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  int rc = SQLITE_OK;
  SQLITE_EXTENSION_INIT2(pApi);
  (void)pzErrMsg;  /* Unused parameter */
  rc = sqlite3_create_collation(db, "natsort", SQLITE_UTF8, 0, natSortCollFunc);
  return rc;
}

(17) By Mark Lawrence (mark) on 2020-03-29 21:45:18 in reply to 16 [link] [source]

I haven't tried out the code (yet) but I would definately use this collation as a replacement where I am currently relying on a user-defined function.

(18) By ddevienne on 2020-03-30 08:20:11 in reply to 16 [link] [source]

FWIW, my impl uses the trick below, to avoid any isdigit() in many cases. --DD

const int diff = l_char - r_char;
if (-9 > diff || diff > 9) {
  // *both* chars cannot be digits, so normal lexicographical compare,
  // even for the mixed char-digit or digit-char cases.
  return diff;
}

Our impls are essentially the same (including skipping leading zeros),
I just didn't think of using memcmp() myself, I reloop on the chars.

(19) By Neal (_Neal_) on 2020-04-02 21:09:02 in reply to 10 [link] [source]

Suggestions for alternative names?

GNU coreutils (sort and ls) have a similar -v option called "Version Sort"(https://www.gnu.org/software/coreutils/manual/html_node/Version-sort-overview.html)

(20) By Peter da Silva (resuna) on 2020-04-03 14:14:19 in reply to 7 [link] [source]

That opens up Unicode collation and localization and that's a whole new can of worms.

(21) By Peter da Silva (resuna) on 2020-04-03 14:16:06 in reply to 15 [link] [source]

You wouldn't use this sort on zip codes, would you?

(22) By Hamish Allan (hatfinch) on 2020-04-03 14:35:33 in reply to 21 [link] [source]

Wouldn't it work anyway, because you'd only be comparing numbers of the same fixed length? (xxxxx or xxxxx-yyyy)

(23) By Dave B (DaveBlake) on 2020-05-06 17:33:57 in reply to 16 [link] [source]

Richard having natural sorting of this kind available as a standard collation in SQLite would be useful.

Actually I am working on a bespoke collation that handles number naturally like this, is case insensitive and does some simple accent folding so 'a', 'á' and 'À' are with 'A'. However the result is slower than I would like, and I think that is partly because of the way I am converting from the const void data of the collation callback function into wstring before processing compare even though most of the data is simple ascii.Need to give that more thought.

(24) By TripeHound on 2020-05-06 21:20:42 in reply to 23 [link] [source]

simple accent folding so 'a', 'á' and 'À' are with 'A'

Just be aware that in some languages it is not (always) correct to simply "fold" (i.e. ignore) accents when sorting (for example, in Swedish – apparently – the letters 'å', 'ä', and 'ö' are considered distinct letters, not 'a' and 'o' with diacritics). See this answer on StackExchange or Diacritic on Wikipedia.

(25) By Dave B (DaveBlake) on 2020-05-06 21:47:57 in reply to 24 [link] [source]

Thanks TripeHound, I am aware that "accent folding" is not the same as full ICU locale specific collation. However the need is to reproduce the same collation using SQLlite as utf8_general_ci in MySQL (now called utf8mb3_general_ci in MariaDB). That is just simple accent folding.

(26) By anonymous on 2021-10-07 16:44:11 in reply to 16 [source]

Hi Richard, I'm using this as a loadable extension and it's working great. Is there any update on whether this will ship in SQLite without having to use a loadable module any time soon?