SQLite User Forum

Lemon: msort loses data (under far-fetched conditions)
Login

Lemon: msort loses data (under far-fetched conditions)

(1) By Sam Atman (mnemnion) on 2025-06-05 15:59:39 [source]

This is a fun one!

Realistic inputs will never trigger the bug. Given the other bounds of the program, however, such as the use of int, it may or may not technically be possible to trigger it. In any case the fix is easy, and should have no other implications.

Let's take a look. Here's msort:

#define LISTSIZE 30
static char *msort(
  char *list,
  char **next,
  int (*cmp)(const char*,const char*)
){
  unsigned long offset;
  char *ep;
  char *set[LISTSIZE];
  int i;
  offset = (unsigned long)((char*)next - (char*)list);
  for(i=0; i<LISTSIZE; i++) set[i] = 0;
  while( list ){
    ep = list;
    list = NEXT(list);
    NEXT(ep) = 0;
     /* The problem is here: */
    for(i=0; i<LISTSIZE-1 && set[i]!=0; i++){
      ep = merge(ep,set[i],cmp,offset);
      set[i] = 0;
    }
    set[i] = ep;
  }
  ep = 0;
  for(i=0; i<LISTSIZE; i++) if( set[i] ) ep = merge(set[i],ep,cmp,offset);
  return ep;
}

The problem occurs if i == LISTSIZE - 1 a second time. If we think of each cell as a bit, we're counting up a little-endian binary number, where 0 is 0, and 1 is any valid pointer. So this happens at exactly the 2^30th node, precisely 1,073,741,824 node's worth of data. At that time, our 'counter' is showing 2^30-1: all ones, from 0 to 29. But i can only count to 28 before it's forced out of the loop.

Every cell in set has a list pointer, including the last one, which the loop doesn't visit or check. Result: set[i] = ep; occurs without a null check taking place, but this time it's not null.

If that did happen, it's bad: the one 'bit' which we clobber has all of the nodes from our first run up to the final cell: half the data! Because most counters use int, which has 2^31 positive values on modern systems, I conjecture that this condition can in fact be triggered before other things like overflow start making life wonky. I haven't attempted to trigger it though: this is a logic bug report, not a fuzz one.

I know that, unlike Pikchr, SQLite, or parsers generated by Lemon, Lemon itself is allowed to go off the rails if given dumb or malicious data. However, the fix is simple, and should be free. We change the loop condition to look like this:

    for(i=0; i<LISTSIZE && set[i]!=0; i++){

and add a postcondition:

    if( i==LISTSIZE ) i--;
    set[i] = ep;

I state this will be 'free', not 'cheap', because the compiler will optimize this by noticing that it can just decline to increment that final time and still break the loop, by merging the loop's postconditional action with the following if expression. We're left with just one check against LISTSIZE, not two, and the code never decrements.

That's just hard to write in C, but the assembly is perfectly straightforward, and I am confident to near the point of certainty that this specific type of rearrangement is known to compiler optimization developers already.

When our far-fetched amount of nodes counts up to this position, we'll grab that final 'bit' and merge it, then put it back right where we found it. A problem if were were counting, but we're collating, so it's just what we want.

This bug is in very good company, see Nearly All Binary Searches and Mergesorts Are Broken for another standard algorithm which was taught and implemented a certain way for many years, while being, the whole time, invalid for extremely large inputs. Despite the title mentioning mergesorts, it's a different bug involving signed overflow.

In the above case, I struggle to conceive of a circumstance which doesn't involve a fuzzer ever triggering the behavior. I haven't finished looking at everything which generates linked lists in Lemon, my impression is that some of those conditions are superlinear to the number of rules or symbols, but none of them look actually quadratic. But say that one of them is, for the sake of argument: you'd need 2^15 symbols or rules to get in trouble, 32,768. Such conditions are wholly artificial, and that's a hand-waving lower bound which makes pessimistic assumptions.

I want to end this just by saying how much fun I'm having with this port. I'm learning a lot about how LALR actually works, spotting some techniques which will really come in handy in the main stream of my research, and discovering various quirks and hidden features of the program (I'm keeping detailed notes and will share them when the project is complete). Thank you, Richard, for dedicating your code to the public domain, thereby making this possible.

And for the avoidance of doubt: I am of the opinion that a correction to a merge sort is not eligible for copyright, as it lacks in any creative expression and concerns itself purely with known facts. Yet, out of an abundance of caution, I hereby disclaim copyright to the entire text of this post, be it code or otherwise.

(2) By Sam Atman (mnemnion) on 2025-06-06 21:46:21 in reply to 1 [link] [source]

I guess I wanted to add that I see one reason to consider applying the change, which is that it canonicalizes the code.

Lemon's behavior will never change under any plausible circumstances due to this latent condition. But if someone were to encounter the merge routine, and think "oh, this is very clever, I bet I can use this", copy it elsewhere, and then either employ it on a very long list, or lower LISTSIZE, that could trigger the incorrect behavior. And it is quite clever: making the routine type-generic by passing the head pointer, and a pointer to the next field, so that the byte offset can be calculated accordingly. That kind of know-how is not as widespread as once it was.

This is something like why I noticed, actually: I plan to add merge sort to a Zig library which provides list operations on an arbitrary intrusive linked list, so I was thinking about whether the routine is valid everywhere, a different question from whether it's valid where I found it. Which it is.

I do imagine that the era of CVE hunters has cooled your interest in reports which amount to "condition impossible by construction is not checked". I am not so motivated. But once I saw it, it seemed right to bring it to your attention.