fts3expr4-1.8 test failure on x86

(1) By Sam James (thesamesam) on 2022-11-21 08:02:05 [source]

Congratulations on another release and thanks. Noticed the following when packaging SQLite 3.40.0 for Gentoo: ``` Time: zerodamage.test 8 ms Time: zipfile.test 180 ms Time: zipfile2.test 10 ms SQLite 2022-11-16 12:10:08 89c459e766ea7e9165d0beeb124708b955a4950d0f4792f457465d71b158alt1 1 errors out of 259912 tests on localhost Linux 32-bit little-endian !Failures on these tests: fts3expr4-1.8 All memory allocations freed - no leaks Maximum memory usage: 9194172 bytes Current memory usage: 0 bytes Number of malloc() : -1 calls make: *** [Makefile:1302: tcltest] Error 1

ERROR: dev-db/sqlite-3.40.0::gentoo failed (test phase):
emake failed ```

This is a multilib build on an amd64/64-bit system.

Note that this is with atof tests skipped because of https://sqlite.org/forum/forumpost/d97caf168f, https://sqlite.org/forum/forumpost/50f136d91d.

(2) By Dan Kennedy (dan) on 2022-11-21 14:15:25 in reply to 1 [link] [source]

Can't reproduce this here. What is the output of running:

./testfixture test/fts3expr4.test

by itself from the root of the source tree?

Dan.

(3) By Sam James (thesamesam) on 2022-11-22 05:14:51 in reply to 2 [link] [source]

Hi Dan, thanks for the reply.

/var/tmp/portage/dev-db/sqlite-3.40.0/work/sqlite-src-3400000-abi_x86_32.x86 # ./testfixture test/fts3expr4.test
fts3expr4-1.1... Ok
fts3expr4-1.2... Ok
fts3expr4-1.3... Ok
fts3expr4-1.4... Ok
fts3expr4-1.5... Ok
fts3expr4-1.6... Ok
fts3expr4-1.7... Ok
fts3expr4-1.8...
! fts3expr4-1.8 expected: [PHRASE 3 0 d:word]
! fts3expr4-1.8 got:      [AND {AND {PHRASE 3 0 d} {PHRASE 3 0 :}} {PHRASE 3 0 word}]
fts3expr4-2.1... Ok
fts3expr4-3.1... Ok
fts3expr4-3.2... Ok
fts3expr4-3.3... Ok
fts3expr4-3.4... Ok
fts3expr4-3.5... Ok
fts3expr4-3.6... Ok
fts3expr4-3.7... Ok
fts3expr4-3.8... Ok
fts3expr4-3.8... Ok
fts3expr4-3.9... Ok
fts3expr4-3.10... Ok
SQLite 2022-11-16 12:10:08 89c459e766ea7e9165d0beeb124708b955a4950d0f4792f457465d71b158alt1
1 errors out of 21 tests on mop Linux 32-bit little-endian
!Failures on these tests: fts3expr4-1.8
All memory allocations freed - no leaks
Memory used:          now          0  max      63352  max-size      48000
Allocation count:     now          0  max        242
Page-cache used:      now          0  max          0  max-size       1032
Page-cache overflow:  now          0  max       1036
Maximum memory usage: 63352 bytes
Current memory usage: 0 bytes
Number of malloc()  : -1 calls

(4) By Dan Kennedy (dan) on 2022-11-22 11:01:04 in reply to 3 [link] [source]

So, it looks like the test is failing because the ICU library is tokenizing (API ubrk_next() and friends) "d:word" into three tokens, instead of leaving it as a single token.

What is the libicu version on the machine? ([icuinfo | grep version])

Do you have any environment variables or such that change your locale to something we haven't previously tested with?

Dan.

(5) By Sam James (thesamesam) on 2022-11-22 21:32:53 in reply to 4 [link] [source]

It's the pretty new ICU 72.1 release:

$ icuinfo | grep version
Plugins are disabled.
    <param name="version">72.1</param>
    <param name="version.unicode">15.0</param>
    <param name="cldr.version">42.0</param>
    <param name="tz.version">2022e</param>

I could try downgrade it but it's a bit tricky because of its unstable ABI. If you run out of ideas, I can do it though.

I don't really have any interesting locale or env vars, just:

$ env | grep -i en_GB
LANG=en_GB.UTF-8

(6) By Dan Kennedy (dan) on 2022-11-23 11:10:30 in reply to 5 [link] [source]

I could try downgrade it but it's a bit tricky because of its unstable ABI. If you run out of ideas, I can do it though.

Sounds a bit drastic...

Can you try building the program below ([gcc file.c -licuuc]) and then running it as follows to see if you get the same output as me?

    $ ./a.out hello:world
    word: (0) "hello:world"
    done: (11) ""
    $ ./a.out hello,world
    word: (0) "hello"
    word: (5) ","
    word: (6) "world"
    done: (11) ""

Thanks,

Dan.

/**********************************************************************/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>

#include <unicode/ubrk.h>

int main(int argc, char **argv){
  UBreakIterator *pIter = 0;
  UErrorCode status = 0;
  char *zIn = 0;
  int nIn = 0;

  UChar *aChar = 0;
  int nChar = 0;
  int32_t i2 = 0;

  if( argc!=2 ){
    fprintf(stderr, "Usage: %s <text>\n", argv[0]);
    return 1;
  }
  zIn = argv[1];
  nIn = strlen(zIn);

  aChar = malloc(sizeof(UChar) * (nIn+1));
  u_strFromUTF8(aChar, nIn+1, &nChar, zIn, -1, &status);
  assert( U_SUCCESS(status) );

  pIter = ubrk_open(UBRK_WORD, 0, aChar, nChar, &status);
  assert( U_SUCCESS(status) );

  ubrk_first(pIter);
  do {
    int32_t i1 = ubrk_current(pIter);
    i2 = ubrk_next(pIter);
    if( i2!=UBRK_DONE ){
      fprintf(stdout, "word: (%d) \"%.*s\"\n", i1, i2-i1, &zIn[i1]);
    }else{
      fprintf(stdout, "done: (%d) \"%s\"\n", i1, &zIn[i1]);
    }
  }while( i2!=UBRK_DONE );

  ubrk_close(pIter);
  free(aChar);
  return 0;
}

(7.3) By Sam James (thesamesam) on 2022-11-24 03:35:44 edited from 7.2 in reply to 6 [link] [source]

(it is a bit, but I can definitely do it in a chroot if you need.)

Note that I needed to add:

#include <unicode/ustring.h>

to avoid

/tmp/icu.c: In function ‘main’:
/tmp/icu.c:27:3: warning: implicit declaration of function ‘u_strFromUTF8’ [-Wimplicit-function-declaration]
   27 |   u_strFromUTF8(aChar, nIn+1, &nChar, zIn, -1, &status);
      |   ^~~~~~~~~~~~~

Output just running it natively (so normal amd64, 64-bit):

$ /tmp/icu hello:world # w/ icu 72
word: (0) "hello"
word: (5) ":"
word: (6) "world"
done: (11) ""

$ /tmp/icu hello,world # w/ icu 72
word: (0) "hello"
word: (5) ","
word: (6) "world"
done: (11) ""

(8.2) By Sam James (thesamesam) on 2022-11-24 04:30:14 edited from 8.1 in reply to 6 [link] [source]

Deleted

(9) By Sam James (thesamesam) on 2022-11-24 04:36:14 in reply to 7.3 [link] [source]

Ionen Wolkens figured it out: C.utf8 splits it, but en_US.utf8 keeps it as hello:world (with 71.1), with 72.1 they act the same.

# ICU 71 + en_US.UTF8
$ export LC_ALL=en_US.UTF8
$ gcc -licuuc /tmp/icu.c -o /tmp/icu && /tmp/icu hello:world && echo -e "\n" && /tmp/icu hello,world
word: (0) "hello:world"
done: (11) ""

word: (0) "hello"
word: (5) ","
word: (6) "world"
done: (11) ""

$ export LC_ALL=C.UTF-8
$ gcc -licuuc /tmp/icu.c -o /tmp/icu && /tmp/icu hello:world && echo -e "\n" && /tmp/icu hello,world
word: (0) "hello"
word: (5) ":"
word: (6) "world"
done: (11) ""

word: (0) "hello"
word: (5) ","
word: (6) "world"
done: (11) ""

(10) By Dan Kennedy (dan) on 2022-11-24 18:09:12 in reply to 9 [link] [source]

Ionen Wolkens figured it out: C.utf8 splits it, but en_US.utf8 keeps it as hello:world (with 71.1), with 72.1 they act the same.

Huh. Fair enough. Thanks for working this out.

We'll just change the test case to accept either answer. The point of the test is that the ":" gets passed through to the tokenizer, not what it does with it, anyway.

https://sqlite.org/src/info/a2b6883ac2ef878f

Dan.

(11) By Sam James (thesamesam) on 2022-12-06 11:54:54 in reply to 10 [link] [source]

Thanks a bunch - that works great!