SQLite Forum

Possible regression with JSON handling
Login

Possible regression with JSON handling

(1.2) By beetlejuice (coleifer) on 2022-07-05 16:26:44 edited from 1.1 [link] [source]

I'm the maintainer of a python library and noticed a failure in a testcase when updating to 3.39 around JSON handling. I have an extension module that leverages the JSON type to implement a simple change-tracking using triggers. The symptom I'm noticing is that it looks like the JSON is being double-encoded (when previously, in 3.38.5, all is working correctly).

Repro:

-- table we are going to modify
CREATE TABLE IF NOT EXISTS "ct2" (
    "id" INTEGER NOT NULL PRIMARY KEY, "data" JSON NOT NULL);
-- changelog table
CREATE TABLE IF NOT EXISTS "changelog" (
    "id" INTEGER NOT NULL PRIMARY KEY,
    "timestamp" DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "action" TEXT NOT NULL,
    "table" TEXT NOT NULL,
    "primary_key" INTEGER NOT NULL, 
    "changes" JSON NOT NULL);

-- post-insert trigger
CREATE TRIGGER IF NOT EXISTS ct2_changes_INSERT
    AFTER INSERT ON ct2
    BEGIN
        INSERT INTO changelog
            ("action", "table", "primary_key", "changes")
        SELECT
            'INSERT', 'ct2', NEW."id", "changes"
        FROM (
            SELECT json_group_object(
                col, json_array("oldval", "newval")) AS "changes"
            FROM (
                SELECT json_extract(value, '$[0]') as "col",
                       json_extract(value, '$[1]') as "oldval",
                       json_extract(value, '$[2]') as "newval"
                FROM json_each(json_array(json_array('data', NULL, json(NEW."data"))))
                WHERE "oldval" IS NOT "newval"
            )
        );
    END;

-- Insert a new row into the table being tracked.
INSERT INTO "ct2" ("data") VALUES (json('{"k1": "v1"}'));

-- See what is stored in the changelog
SELECT * FROM "changelog";

On 3.38.5, the final SELECT returns the following:

1|2022-07-05 15:28:29|INSERT|ct2|1|{"data":[null,{"k1":"v1"}]}

On 3.39.0, the final SELECT returns the following:

1|2022-07-05 15:26:59|INSERT|ct2|1|{"data":[null,"{\"k1\":\"v1\"}"]}

Here is another example -- note the handling of a JSON string literal is not affected(!)

3.38.5:

-- insert a quoted json string
INSERT INTO ct2 (data) VALUES ('"hello"');

SELECT * FROM changelog;
2|2022-07-05 16:22:17|INSERT|ct2|2|{"data":[null,"hello"]}

3.39.0:

2|2022-07-05 16:25:09|INSERT|ct2|2|{"data":[null,"hello"]}

(2) By Dan Kennedy (dan) on 2022-07-05 18:04:43 in reply to 1.2 [link] [source]

Looks like you were affected by this change:

https://sqlite.org/src/info/bbaf1f2eb

See also this thread:

https://sqlite.org/forum/forumpost/3d9caa45cbe38c78

Dan.

(3.2) By beetlejuice (coleifer) on 2022-07-05 18:27:31 edited from 3.1 in reply to 2 [link] [source]

Yes, I thought that looked like the commit as well. Perhaps drh can chime in when he has a moment to take a look, whether he feels that some fix is warranted?

The difference in handling of the json value versus the quoted JSON string ('"hello"'), to me, indicates some internal inconsistency.

Alternatively, it might make sense to allow the json() function to accept a string and return it quoted, e.g. json('hello') -> '"hello"' (since it already handles NULL and numeric values). Then I could theoretically place my call to the json() higher-up in the query since the subtype is not being propagated.

(4.1) By beetlejuice (coleifer) on 2022-07-06 13:30:30 edited from 4.0 in reply to 2 [link] [source]

Dan / DRH, I've come up with a slightly smaller reproduction that (I feel) demonstrates the issue. I know your team is small and very busy, so I appreciate you taking the time on this post.

Here I'm using "json" stuff from top-to-bottom:

select json_array(d) from (
    select json_extract(value, '$') as d
    from json_each(json_array('{"k1": "v1"}'))
);

-- 3.39.0 gives:
["{\"k1\":\"v1\"}"]


-- 3.38.5 gives:
[{"k1":"v1"}]

Wrapping the json_extract() with json(json_extract(...)) doesn't have any effect, of course.

Based on the forum post and patch linked, it seems the suggested workflow is to wrap with json() at the outermost level, but this doesn't sit quite right to me:

select json_array(json(d)) from (
    select json_extract(value, '$') as d
    from json_each(json_array('{"k1": "v1"}'))
);

What if, at the outermost layer, I might not necessarily have JSON but might have a string literal?

This is valid:

sqlite> select json_array('foo');
["foo"]

This is not:

sqlite> select json_array(json('foo'));
Runtime error: malformed JSON

Simplifying even further,

-- 3.39.0
sqlite> select json_array(d) from (select json('{"k1":"v1"}') as d);
["{\"k1\":\"v1\"}"]

-- 3.38.5
sqlite> select json_array(d) from (select json('{"k1":"v1"}') as d);
[{"k1":"v1"}]

(5) By Richard Hipp (drh) on 2022-07-07 15:28:32 in reply to 3.2 [source]

I consider the change at https://sqlite.org/src/info/bbaf1f2eb to be a bug fix. The current behavior is correct. The 3.38 behavior was incorrect.

The issue is that content coming out of a subquery or view can only be one of TEXT, INTEGER, REAL, or NULL due to limitations of the SQLite file format. We can (and do) add new data types like JSON for intermediate results of sub-expressions within a single query. But, unfortunately, the file format does not permit saving those datatypes and so JSON values must be converted into TEXT for storage.

Subqueries and views are not (necessarily) written to storage using the SQLite file format. But, the query should behave the same using a subquery or view as it does for a ordinary on-disk table. That is the bug that was fixed. Prior to the bug fix, subqueries and views outputs might preserve the special JSON meaning. Or they might not, depending on what query algorithm the query planner selected. In other words, you might get inconsistent answers depending on what indexes are defined or what optimizations are enabled or disabled.

This is the example SQLite statement supplied by the OP:

select json_array(d) from (
    select json_extract(value, '$') as d
    from json_each(json_array('{"k1": "v1"}'))
);

In the outer query, the value of column "d" should be TEXT, not JSON. Hence the correct answer is:

'["{\"k1\":\"v1\"}"]'

If you want the value "d" to be interpreted as JSON, then make the query this:

select json_array(json(d)) from (
    select json_extract(value, '$') as d
    from json_each(json_array('{"k1": "v1"}'))
);

Then the result will be:

'[{"k1":"v1"}]'

If you want the json_array() in the outer query to interpret its input as JSON if it looks like JSON, and TEXT if it does not look like JSON, then you can code it this way:

select json_array(CASE WHEN json_valid(d) THEN json(d) ELSE d END) from (
    select json_extract(value, '$') as d
    from json_each(json_array('{"k1": "v1"}'))
    union all
    SELECT 'xyzzy'
);

(6) By beetlejuice (coleifer) on 2022-07-07 19:30:05 in reply to 5 [link] [source]

Thank you for the background explanation of the underlying issue that was fixed, as well as providing a solution to my more immediate problem. Combining CASE with json_valid had not occurred to me, and is working well. Again, thank you for your time and effort.