Overriding collation of IN expressions

(1.1) Originally by Mark Lawrence (mark) with edits by Richard Hipp (drh) on 2020-05-20 15:59:55 from 1.0 [link]

The documentation at [\[1\]][1] on collation says:

     The rules for determining which collating function to use for a
     binary comparison operator (=, <, >, <=, >=, !=, IS, and IS NOT)
     are as follows:

        1. If either operand has an explicit collating function
        assignment using the postfix COLLATE operator, then the
        explicit collating function is used for comparison, with
        precedence to the collating function of the left operand.

It also mentions the following:

    The expression "x IN (SELECT y ...)" is handled in the same way as
    the expression "x = y" for the purposes of determining the
    collating sequence. The collating sequence used for expressions of
    the form "x IN (y, z, ...)" is the collating sequence of x.

Given the above I would have expected that SELECT with an IN expression
to behave the same as a SELECT with an '=' expression when COLLATE
NOCASE is added to either one. But that does not seem to be the case:

    -- 3.32.0 2020-04-18 14:12:00 d5b0def96ba6d90f47bc96fab1ccf9c501d84885d086744035b16fd96f3e248c

    CREATE TABLE a(b TEXT);
    INSERT INTO a
    VALUES
        ('one'),
        ('ONE');

    SELECT *
    FROM a
    WHERE b = 'one' COLLATE nocase;

    -- b
    -- ----------
    -- one
    -- ONE

    SELECT *
    FROM a
    WHERE b IN ('one') COLLATE nocase;

    -- b
    -- ----------
    -- one

This appear to be an oversight in (hopefully) the implementation or the
documentation.

[1]: https://www.sqlite.org/datatype3.html#assigning_collating_sequences_from_sql

(2) By Richard Hipp (drh) on 2020-05-20 16:22:38 in reply to 1.1

In the [syntax diagram for expressions][1], we see that the COLLATE operator
is applied to an "expr".  The COLLATE operator has high precedence, so it binds
more tightly that most everything else.  For the phrase:

~~~~
    b IN ('one') COLLATE nocase
~~~~

The "`('one')`" part is *not* an expression. That subphrase is part of the
IN expression.  Hence, the expression must be parsed thusly:

~~~~
    (b IN ('one')) COLLATE nocase
~~~~

So, in other words, the COLLATE operator is being applied to the result of
the IN operator, not to the operands of the IN operator.

Contrast the above with the equality constraint:

~~~~
    b == 'one' COLLATE nocase
~~~~

Here, the "`'one'`" part is an expression, and so the COLLATE operator
is able to bind to it.  And since COLLATE has higher precedence than ==,
we get the following parse:

~~~~
    b == ('one' COLLATE nocase);
~~~~

So here, the COLLATE operator applies to the right operand of the == operator
and thus affects the operation of the == operator.

If you want to apply a collation to an IN operator, you need to put the
COLLATE operator on the left operand, which is an expression.  Hence, you
probably want this:

~~~~
    b COLLATE nocase IN ('one')
~~~~

Which parses out as:

~~~~
    (b COLLATE nocase) IN ('one')
~~~~

[1]: https://www.sqlite.org/lang_expr.html

(3) By Mark Lawrence (mark) on 2020-05-20 16:49:16 in reply to 2 [link]

Thanks Richard. That makes it clear why it works (or doesn't) the way it does. 

New syntax I learned today:

        sqlite> select 'b' collate nocase = 'B';
        'b' collate nocase = 'B'
        ------------------------
        1

This positioning of collate is new to me... Can I suggest an additional sentence in datatype3.html:

        The collating sequence used for expressions of the form
        "x IN (y, z, ...)" is the collating sequence of x. If an explicit
        collating sequence is required it should be applied to the left hand
        side of the expression: "x COLLATE nocase IN (y,z, ...)"