GLOB index acceleration for special characters like star and square bracket?

(1.1) By leijurv on 2023-02-06 07:41:37 edited from 1.0 [source]

sqlite> create table files(path string);
sqlite> create index files_by_path on files(path);
sqlite> explain query plan select * from files where path glob 'a*'; -- anything beginning with a: accelerated
QUERY PLAN
`--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
sqlite> explain query plan select * from files where path glob '[a]*'; -- anything beginning with a: not accelerated
QUERY PLAN
`--SCAN files
sqlite> explain query plan select * from files where path glob '[*]*'; -- anything beginning with *: impossible to accelerate?
QUERY PLAN
`--SCAN files
sqlite> explain query plan select * from files where path glob '[[]*'; -- anything beginning with [: impossible to accelerate?
QUERY PLAN
`--SCAN files
sqlite> explain query plan select * from files where path glob '[]]*'; -- anything beginning with ]: impossible to accelerate?
QUERY PLAN
`--SCAN files
sqlite>

I use SQLite as a backup system. One use case requires querying the contents of a directory, which I implement as WHERE path GLOB '/path/to/dir/*'. But, when the path contains a character that GLOB treats as special, like a star or a square bracket, I need to enclose it in more square brackets as shown above. While this allows me to get a correct query result, sadly, this also loses the ability for the query to be index accelerated (also as shown above).

Is there a better way to do this? How can I perform this query, accelerated by the index? Perhaps SQLite could recognize that GLOB '[x]...' is the same as GLOB 'x...' and accelerate this in a future version?

(the actual use case is here. and in reality it's more like WHERE path GLOB '/some/path/folderwith[[]brackets[]]/more/folders/*' and the problem is that the index only helps up to the first square bracket while really there are tons of files in those deeper subfolders that need to be indexed for the query to run quickly)

(2) By Gunter Hick (gunter_hick) on 2023-02-06 08:03:15 in reply to 1.1 [link] [source]

Have you checked https://sqlite.org/optoverview.html#the_like_optimization ?

Only your first example fits the conditions for this optimization.

To be considered for use, an index has to exactly match a subset of the contraints or satisfy the complete ORDER BY clause.

Each of your other examples would require an index on (path GLOB <pattern>), which is probably more indices than you are prepared to maintain.

(3) By ddevienne on 2023-02-06 08:39:51 in reply to 2 [link] [source]

Only your first example fits the conditions for this optimization.

Isn't that missing the OP's point though?

I.e. that those extra brackets are an artifact of the GLOB's syntax escaping rules,
but the optimizer currently does not know that, and thus there is a missed opportunity on that optimization.

In other words, the queries with brackets logically should be able to take advantage of the like optimization,
but they fail to do so with the current implementation. And the OP is hoping that DRH can remedy that.
(probably with like 2-10 lines of code, as he often amazingly can do!)

(4) By Gunter Hick (gunter_hick) on 2023-02-06 09:39:29 in reply to 3 [link] [source]

That depends on how many different expressions the OP needs and how quickly.

There is no check for the special "set of one" construct, nor is there code to replace the "set-of-one" with the one character it stands for. Both need to consider multibyte UTF codepoints to determine the correct distance between '[' and ']' characters.

(5) By leijurv on 2023-02-06 09:40:47 in reply to 3 [link] [source]

Yes thank you. A patch to allow a single character in square brackets to be index accelerated in a GLOB just as well as that character alone would be appreciated.

In short, I'd love WHERE path GLOB '[a]*' to work just like WHERE path GLOB 'a*', even though it doesn't today, so as to allow convenient searching for special characters like asterisk and square brackets.

Alternatively, after reading the above link, I suppose I could do a lexicographic select manually then filter it. For example WHERE path GLOB 'b*foo' becomes WHERE path > 'a' AND path < 'c' AND path GLOB 'b*foo' perhaps? But that's worse and less readable and such.

(6) By leijurv on 2023-02-06 09:43:39 in reply to 4 [link] [source]

There could be such an expression for any of the millions of directories I have backed up, sadly, so I cannot make an index for each. The use case is https://github.com/leijurv/gb/commit/e2b543ce380e865d900072cb0d6932be28040ef9

I'd love such a check :-) or an alternative (supported in existing sqlite)?

(7.1) By Gunter Hick (gunter_hick) on 2023-02-06 09:56:55 edited from 7.0 in reply to 5 [link] [source]

Actually you would want

path > 'b*' AND path < 'b+' AND path GLOB 'b[*]foo'

(8) By Gunter Hick (gunter_hick) on 2023-02-06 10:00:17 in reply to 5 [link] [source]

WHERE path    > '/some/path/folderwith[brackets]/more/folders/*'
  AND path    < '/some/path/folderwith[brackets]/more/folders/+'
  AND path GLOB '/some/path/folderwith[[]brackets[]]/more/folders/*'

(9) By ddevienne on 2023-02-06 10:10:56 in reply to 8 [link] [source]

Perhaps I'm missing something, but why would the GLOB be necessary?

You are manually doing the like-optimization for GLOB basically. So the GLOB is consumed and no longer necessary IMHO.

And the * is for GLOB, so shouldn't be in the RHS of >.
And 0 is the next char after / then, not +.

The OP's hope is for that optimization to be EQP-driven, not manual.

WHERE path > '/some/path/folderwith[brackets]/more/folders/'
  AND path < '/some/path/folderwith[brackets]/more/folders0'

(10) By Keith Medcalf (kmedcalf) on 2023-02-06 18:35:07 in reply to 9 [link] [source]

Why not just:

where path between '/some/pathfolderwith[' and '/some/pathfolderwith[' || char(1e7)

This will select all paths that startwith the sequence '/some/pathfolderwith['. char(1e7) returns the largest possible unicode codepoint as utf-8. If you really want to ensure that there is a ']/' after that then you can cull the index located candidates using a regex or glob if you like (whichever floats your boat and is available to you) since the path between will be used to constrain the candidates via an index.

where path between '/some/pathfolderwith[' and '/some/pathfolderwith[' || char(1e7)
  and path regex '.+/\[[0-9A-Za-z]*\]/.+'

(11) By anonymous on 2023-02-06 19:16:52 in reply to 10 [link] [source]

char(1e7) returns the largest possible unicode codepoint as utf-8.

Nope. The char function returns U+FFFD REPLACEMENT CHARACTER for out-of-range inputs. char(0x10FFFF) would work.

(12) By leijurv on 2023-02-06 20:13:30 in reply to 7.1 [link] [source]

I see the confusion, but no. I would want path > 'b*foo' || NULL AND path < 'b*foo' || 0xff (or something vaguely like that, where NULL means a 0 character and 0xff means the highest possible character, I think).

The whole point is that I am able to query the entire path using the index and NOT stopping at the asterisk. The current GLOB approach only uses the index "up to" the asterisk, I want to fix that.

(13) By leijurv on 2023-02-06 20:18:22 in reply to 10 [link] [source]

Why would I have to stop at the first square bracket? Perhaps couldn't I solve the problem with nothing more than WHERE path BETWEEN '/some/path/folderwith[brackets]/more/folders.' AND '/some/path/folderwith[brackets]/more/folders0'? I think this would work because . is the ASCII character right below / and 0 is the ASCII character right above it? Or am I missing something?

(14.2) By leijurv on 2023-02-08 02:18:43 edited from 14.1 in reply to 13 [link] [source]

Actually, as per https://sqlite.org/forum/forumpost/2dd98dab4bfb52d1 I have just realized that BETWEEN is inclusive not exclusive. Therefore I just need something like: WHERE path BETWEEN '/some/path/folderwith[brackets]/more/folders/' AND '/some/path/folderwith[brackets]/more/folders/' || char(0x10FFFF). It seems like this works correctly:

sqlite> create table files(path string);
sqlite> create index files_by_path on files(path);
sqlite> explain query plan SELECT * FROM files WHERE path BETWEEN 'abc/' and 'abc/' || char(0x10FFFF);
QUERY PLAN
`--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
sqlite> explain query plan SELECT * FROM files WHERE path BETWEEN 'abc/with[brackets]/' and 'abc/with[brackets]/' || char(0x10FFFF);
QUERY PLAN
`--SEARCH files USING COVERING INDEX files_by_path (path>? AND path<?)
sqlite> SELECT 'abc/with[brackets]/foo' BETWEEN 'abc/with[brackets]/' and 'abc/with[brackets]/' || char(0x10FFFF);
1
sqlite>

Thank you all, I think this solves it!

Edit: committed

(15) By Keith Medcalf (kmedcalf) on 2023-02-06 21:15:13 in reply to 11 [link] [source]

You are correct. However, I suspect that you only need to append x'ff'. That works with all the collations I have tried, including unicode (icu) collations (at least on Windows using the built-in Windows ICU).

(16.1) By ddevienne on 2023-02-07 08:49:25 edited from 16.0 in reply to 14.1 [link] [source]

Thank you all, I think this solves it!

WHERE path
  BETWEEN '/some/path/folderwith[brackets]/more/folders/'
      AND '/some/path/folderwith[brackets]/more/folders/ || char(0x10FFFF)'

I don't see how that's better than https://sqlite.org/forum/forumpost/773c6df11c, to be honest.

Strictly less than the next char that replaces the tail char (i.e. exclusive), versus Equal-or-less than the highest-possible char appended to the tail (i.e. inclusive), are functionally equivalent. (you just have to check one more char!)

It's also more verbose, and assume prior knowledge of inclusivity of between and of unicode limits, while op> and op< are mostly obvious to all.

Also the SQL is closer to the plan, with the ordering operators. But to each his/her own I guess :)

PS: On 2nd thought, the one advantage of the above between where-clause is that you can bind the same value, using between ?1 and ?1 || char(0x10FFFF).

(17.1) By Keith Medcalf (kmedcalf) on 2023-02-07 17:30:06 edited from 17.0 in reply to 16.1 [link] [source]

As an interesting note, you should probably append x'ff' (or x'ffff' or even x'f48fbfbf') rather than char(0x10ffff).

Although char(0x10ffff) should be constant (the function declaration declares it as constant) the construct:

where x between 'something' and 'something' || char(0x10ffff)

currently executes the 'something' || char(0x10ffff) for each row (though it would only do the calculation once if it were being used to constrain an index).

where x between 'something' and 'something' || x'ff'

computes 'something' || x'ff' as a constant in the statement prolog.

Correction -- it does put a ONCE construct around the computation so ignore this

(18) By leijurv on 2023-02-09 03:47:29 in reply to 16.1 [link] [source]

My thought is that to do it with op< and op> I would need to do something like: WHERE path > ?1 AND path < ?2, and pass in two parameters, one being the path and the other being the path with the last slash replaced with a period. That feels awkward to me, as I'd have to write extra code in my Go program to handle this in every case where I want to query the contents of a directory. WHERE path BETWEEN ?1 AND ?1 || CHAR(0x10FFFF) allows me to pass in the directory unmodified.

(19) By ddevienne on 2023-02-09 08:44:06 in reply to 18 [link] [source]

Yes, that's my PS edit was about (which you may have missed).

But if your path always ends with a slash, you can also trim it in code and do:

WHERE path BETWEEN ?1 || '/' AND ?1 || '0'

PS: Not a period, but a 0. The next char after / is 0, not ,.