Overriding collation of IN expressions

(1.1) Originally by Mark Lawrence (mark) with edits by Richard Hipp (drh) on 2020-05-20 15:59:55 from 1.0 [link] [source]

The documentation at [1] on collation says:

 The rules for determining which collating function to use for a
 binary comparison operator (=, <, >, <=, >=, !=, IS, and IS NOT)
 are as follows:

    1. If either operand has an explicit collating function
    assignment using the postfix COLLATE operator, then the
    explicit collating function is used for comparison, with
    precedence to the collating function of the left operand.

It also mentions the following:

The expression "x IN (SELECT y ...)" is handled in the same way as
the expression "x = y" for the purposes of determining the
collating sequence. The collating sequence used for expressions of
the form "x IN (y, z, ...)" is the collating sequence of x.

Given the above I would have expected that SELECT with an IN expression to behave the same as a SELECT with an '=' expression when COLLATE NOCASE is added to either one. But that does not seem to be the case:

-- 3.32.0 2020-04-18 14:12:00 d5b0def96ba6d90f47bc96fab1ccf9c501d84885d086744035b16fd96f3e248c

CREATE TABLE a(b TEXT);
INSERT INTO a
VALUES
    ('one'),
    ('ONE');

SELECT *
FROM a
WHERE b = 'one' COLLATE nocase;

-- b
-- ----------
-- one
-- ONE

SELECT *
FROM a
WHERE b IN ('one') COLLATE nocase;

-- b
-- ----------
-- one

This appear to be an oversight in (hopefully) the implementation or the documentation.

(2) By Richard Hipp (drh) on 2020-05-20 16:22:38 in reply to 1.1 [link] [source]

In the syntax diagram for expressions, we see that the COLLATE operator is applied to an "expr". The COLLATE operator has high precedence, so it binds more tightly that most everything else. For the phrase:

    b IN ('one') COLLATE nocase

The "('one')" part is not an expression. That subphrase is part of the IN expression. Hence, the expression must be parsed thusly:

    (b IN ('one')) COLLATE nocase

So, in other words, the COLLATE operator is being applied to the result of the IN operator, not to the operands of the IN operator.

Contrast the above with the equality constraint:

    b == 'one' COLLATE nocase

Here, the "'one'" part is an expression, and so the COLLATE operator is able to bind to it. And since COLLATE has higher precedence than ==, we get the following parse:

    b == ('one' COLLATE nocase);

So here, the COLLATE operator applies to the right operand of the == operator and thus affects the operation of the == operator.

If you want to apply a collation to an IN operator, you need to put the COLLATE operator on the left operand, which is an expression. Hence, you probably want this:

    b COLLATE nocase IN ('one')

Which parses out as:

    (b COLLATE nocase) IN ('one')

(3) By Mark Lawrence (mark) on 2020-05-20 16:49:16 in reply to 2 [source]

Thanks Richard. That makes it clear why it works (or doesn't) the way it does.

New syntax I learned today:

    sqlite> select 'b' collate nocase = 'B';
    'b' collate nocase = 'B'
    ------------------------
    1

This positioning of collate is new to me... Can I suggest an additional sentence in datatype3.html:

    The collating sequence used for expressions of the form
    "x IN (y, z, ...)" is the collating sequence of x. If an explicit
    collating sequence is required it should be applied to the left hand
    side of the expression: "x COLLATE nocase IN (y,z, ...)"