Two flaws in the design of SQLite, which should be easily to be fixed

(1) By anonymous on 2022-06-26 21:58:11 [source]

The inability to do time zone conversions in the VFS. (It does not make sense that the current time is a part of the VFS but time zones isn't.)
The inability to properly deal with non-Unicode text (it can be done, but whichever way you do it, some feature will fail to work well).

Although you can store non-Unicode text, there are problems with doing so; depending on which way you do it, you will have different problems.

If you use blobs instead of texts:

You cannot write string literals except as hex or using CAST, both of which are messy.
Collations will not work, since collation is not applicable to blobs.
You will have to add functions to concatenate blobs and make single byte blobs, since they are not included.
LIKE will not work properly.
There might not be a null-terminator. (This is potentially significant even if the data is not any kind of text; it is often useful to have a null-terminator in other cases too.)

If you do use texts, which store non-Unicode text, and you will have to rewrite the built-in text-dealing functions:

This adds a lot of messy code to the program, to do something which ought to be simple, becomes very complicated.
The LENGTH and LIKE optimizations will not work.
Changing the case_sensitive_like option will cause it to incorrectly assume Unicode text.
Some functions will fail to work properly if loaded by a program which does not expect this (especially if such functions are used in the definition of a index).

It is also possible to use "false UTF-8", but this has its own problems; it also makes the program very messy. Another alternative is to convert to/from Unicode, but this is not always possible to do losslessly (or at all, sometimes), and also makes the program much more messy. Either of these will also result in the text taking up twice as much disk space and RAM as it ought to do.

Another problem is that the SQLite command shell does not properly check the locale. The locale will not necessarily be Unicode (even on Linux; I use GNU/Linux but have a non-Unicode locale set, and I will not change it).

My proposal for non-Unicode text is:

Value 0 in offset 56 means ISO-8859-1 (although the actual text might not be ISO-8859-1, or might not even be possible to convert to/from Unicode at all), which activates non-Unicode mode.
Most things will work the same with ISO-8859-1 mode or UTF-8 mode, including collations, C API functions, etc.
In non-Unicode mode, many of the built-in text-dealing SQL functions (including LIKE and GLOB) will work with bytes instead of code points (and the LENGTH optimization will work for TEXT values).
JSON functions always work with Unicode, regardless of the database text encoding (since JSON data is always Unicode anyways; this is one of the flaws of JSON (although JSON has several other flaws too)). The %!s format also would be always with Unicode.
Add some C functions to easily check for database text encoding. One possibility is sqlite3_db_text_encoding. You might also have a variant of sqlite3_value_type (I am not sure what to call it) which may return SQLITE_NONUNICODE_TEXT, which might simplify the implementation of some built-in functions.

Even if you do not use non-Unicode text, functions to concatenate blobs and make single byte blobs would still be useful, anyways.

(2) By anonymous on 2022-06-27 00:50:19 in reply to 1 [link] [source]

Why do you always refuse to fix these flaws?

(It does seems to assume that each character is one byte on EBCDIC-based systems (at least for LIKE; possibly not for other functions). A compile-time option to do this even on non-EBCDIC systems would be better.)

(Will it really be necessary to fork SQLite or to write an extension which is then used in all programs, just to avoid the use of Unicode (and still allow collation, etc to work)?)

And yet, the problem with time zones cannot easily be fixed by an extension (due to the special handling of date/time functions), regardless of whether or not it is using Unicode.

(My above mentions of working on non-Unicode mode might not work; it might need to be a compile time option, to make it simpler to implement. However, the working with time zones cannot be a compile time option; it will need to be a VFS method.)

(4) By KIT.james (kjames3411) on 2022-06-27 12:56:55 in reply to 2 [link] [source]

you would need to reimplement ICU (and audit it with the same quality) or trust ICU completely.

Or maybe support it as an extension? But unicode is really messy and complicated.

Also, you would have to import all the tables.

Same thing for timezones but with less data and a bit less complexity.

(6) By anonymous on 2022-06-27 20:24:51 in reply to 4 [link] [source]

you would need to reimplement ICU (and audit it with the same quality) or trust ICU completely.

Or, not use ICU. (Many kinds of programs do not need the capabilities of ICU.)

(8) By KIT.james (kjames3411) on 2022-06-28 12:44:36 in reply to 6 [link] [source]

But if you do not use ICU, then you can't have a good/complete Unicode implementation, and then you have an endless debate/struggle about "where to draw the line between what is implemented and what is not".

And that would be way worse I think. Unless everyone here is interested in making some sort of "simplified Unicode" spec where 90%+ people would be satisfied.

While still allowing for extensions for the remaining 10%- who are not. (phew!)

(10) By anonymous on 2022-06-28 20:51:44 in reply to 8 [link] [source]

I am not suggesting that no programs need ICU. Some programs might need ICU. Some programs might not need ICU; they might need something else, or they might not need something else. Some might not deal with the things that ICU deals with, so does not need it. Some programs might use non-Unicode character encodings (and converting them to/from Unicode may be inappropriate, or in some cases impossible). Some programs might not (or ought not) care about character encoding at all. (And even programs that do use Unicode, might need to use private characters, requiring overriding some properties.)

(13) By KIT.james (kjames3411) on 2022-06-29 14:58:47 in reply to 10 [link] [source]

I am not saying that either.

But, if you don't implement all of Unicode, then you only have "something that you created that looks like Unicode in some cases" and that definition depends on you.

Who draws the line between what should be included and what should not?

(3.1) By Donal Fellows (dkfellows) on 2022-06-27 10:45:37 edited from 3.0 in reply to 1 [link] [source]

The inability to do time zone conversions in the VFS. (It does not make sense that the current time is a part of the VFS but time zones isn't.)

While I'd like better handling of DATETIME columns (including an understanding at the DB level that that's what they really are) it's hardly catastrophic. Being able to have the current time as a timestamp is highly useful. Sane code will store the timestamps as zulu-based text timestamps, julian days or unix epoch seconds. None of those has the current timezone that the DB engine is running in (and you don't generally want that, as is makes handling timestamps in the region of a DST change very difficult, let alone communicating with another timezone); if you care about a particular timezone, say which one yourself.

Be aware that timezones are much more complex than they appear to be.

The inability to properly deal with non-Unicode text

Speaking of "much more complex than they appear to be", that applies also to encodings. For the sanity of all, it's best to keep a particular DB instance using a single encoding and knowing what that encoding is. It probably also ought to be UTF-8, which is compact in the subset of Unicode that corresponds to ASCII. If you transfer the DB to another system, that will enable it to continue to work with minimum fuss. Attempting to use a non-Unicode-based encoding will result in characters in code being impossible to express in the DB.

A consequence of this is that your code will need to perform the to/from Unicode transformation to use text with the DB at all. If you don't do that, the results will look... kinda random. (They'll be deterministic, but not what you expect, and the explanations will be hard to understand too.)

I'm not aware of any character in a non-Unicode encoding that cannot be expressed in Unicode (except for some stuff to do with mode switching in a couple of complex encodings, which is an entirely unnecessary concept in Unicode). One of the purposes of Unicode is to make it possible to express all characters in use in any other standardised encoding in one single encoding.

Speaking of which, you are aware that some encodings are really insanely complicated? Shift-JIS still gives me the heebie-jeebies. The EBCDIC coding tables for translating to/from Unicode are trivial, but the system you need to do the same for Shift-JIS is just nuts.

In both of the cases that you complain about, the correct fix lies in your code and not in SQLite precisely because they're very difficult to solve entirely generically, but much easier to solve within the context of your code.

And you really don't want to do BLOB concatenation in SQLite at the moment; yes, it works, but it's dreadfully inefficient.

(5) By anonymous on 2022-06-27 20:23:25 in reply to 3.1 [link] [source]

While I'd like better handling of DATETIME columns (including an understanding at the DB level that that's what they really are) it's hardly catastrophic.

Yes, it isn't catastrophic. Nevertheless, it does seem wrong to me that SQLite handles time zones in the way that it does even though other things in SQLite do not work that way.

Sane code will store the timestamps as zulu-based text timestamps, julian days or unix epoch seconds. None of those has the current timezone that the DB engine is running in (and you don't generally want that, as is makes handling timestamps in the region of a DST change very difficult, let alone communicating with another timezone)

I know that, and I do not store date/time with timezones in the database, if I can avoid it (and, usually I can avoid it, but not always). And usually my programs do not deal with time zones at all (or will do so outside of SQL codes).

However, if you want to use the date/time functions to display or convert times in the local time zone, then you might want to change the time zone in the VFS, especially if you are using a different method for determining the current date/time, too.

Furthermore, you cannot effectively override these functions due to the special handling of date/time functions in SQLite. (Although, one possible way to work around might be to use a separate function to convert time zones; another possibility would be adding something like sqlite3_context_time function. Neither seems satisfactory, though.)

Be aware that timezones are much more complex than they appear to be.

I am aware. Anyways, even if it is implemented in the VFS, for backward compatibility that function can be null to do what it currently does.

What it currently does still depends on the operating system (and this is documented); allowing it in the VFS allows to customize that just as much as other things that deal with the operating system, including the file locking, current date/time, source of random data, extension loading, etc.

But it just seems strange to me that the current time is from the VFS but the time zone conversion isn't from the VFS; that does not make sense to me.

For the sanity of all, it's best to keep a particular DB instance using a single encoding and knowing what that encoding is.

Sometimes; not always. Sometimes it is more useful to store multiple encodings, and the database engine shouldn't care; it is up to the application to deal with that and the database engine should not interfere with the application is trying to do so.

It probably also ought to be UTF-8, which is compact in the subset of Unicode that corresponds to ASCII.

Using UTF-8 is OK if you are only storing ASCII text, and often I do. (UTF-8 does have that advantage, but that does not make it perfect for everything else too.) But sometimes it is necessary to use other encodings, and the SQLite core shouldn't need to care.

If you transfer the DB to another system, that will enable it to continue to work with minimum fuss. Attempting to use a non-Unicode-based encoding will result in characters in code being impossible to express in the DB.

If you transfer it to another system, and are reading it using the same application software then presumably it should know what encoding it is, so that will not be a problem, regardless of the encoding.

You can express non-Unicode encodings in a database (and I have successfully done so), although the current implementation will have problems with that, depending which way you do it, a different set of problems will apply:

You will have to write an extension and load it with every database connection (to do something that is actually simpler than what SQLite core does!!!), and the LIKE and LENGTH optimizations will not work.
You will have to use casts everywhere to convert to/from blobs.
You will have to use hex blob literals to store text rather than using string literals.
You will have to avoid the use of SQL functions to deal with the text in the database.
You will have to convert it to Unicode even though your program does not use it, possibly wasting disk space and CPU time, and possibly resulting in inaccurate conversions.
You will have to convert it to "false UTF-8", also wasting disk space and CPU time.
You will have to make a fork of SQLite to support this feature properly. ("Not-forking" also might need maintenance due to possible interference with other changes made in the code in future.)

None of these are really satisfactory.

A consequence of this is that your code will need to perform the to/from Unicode transformation to use text with the DB at all. If you don't do that, the results will look... kinda random.

This transformation is not always really necessary, shouldn't be necessary always, and will not always work.

Even if you are storing Unicode text in the database, SQL functions such as LENGTH count code points, which is not necessarily useful (unless you are perhaps converting it to UTF-32 or to a list of code points, but you will rarely do that within SQL, except possibly with a recursive CTE). It does not count bytes, composed characters, etc. Counting bytes is often more useful; currently you will have to use CAST to do that.

Also the display is not necessarily Unicode either (and as I have explained before, in my case it often isn't, and I have good reasons for this and will not change it).

I'm not aware of any character in a non-Unicode encoding that cannot be expressed in Unicode

I am aware of many, because I have worked with character encodings. Some cannot be converted at all, some can be converted but inexact, or ambiguous, or with wrong character properties (e.g. width (which isn't purely a property of the font, despite what some people think), text direction (e.g. Hebrew and Arabic), line drawing not lining up properly, Han unification, etc). And you might have to convert them back to work with them, depending what you are doing. If the display is not Unicode, then attempting to display Unicode text on it will be wrong, too.

One of the purposes of Unicode is to make it possible to express all characters in use in any other standardised encoding in one single encoding.

Yes, but it does not work, will not work, and cannot work, despite what they say will work. No single character set will do everything; it does not matter if it is Unicode or something else; it will not work.

I know about working of Unicode and of other character encodings, so I can conclude that Unicode is no good. If I want to write the code to deal with other character encodings, you shouldn't try to stop me.

Speaking of which, you are aware that some encodings are really insanely complicated? Shift-JIS still gives me the heebie-jeebies.

Yes, I am aware. UTF-8 is a simple encoding (and as an encoding, has some advantages (which are not necessarily the things that are important for some applications, though)), but the character set is complicated, and regardless of that is not always appropriate anyways.

Shift-JIS is complicated, and I think that EUC-JP is better (if you want to write in Japanese). Still, some applications will use Shift-JIS.

However, Shift-JIS does not use the ASCII codes for apostrophe, asterisk, and line breaks, for other purposes than those ASCII characters, therefore it should be acceptable for string literals and comments in a SQL code to be encoded as Shift-JIS if that is what the author of that program desires. (This will not work for identifiers though, but that is OK; you will have to use only ASCII identifiers, and/or katakana.) The SQLite core should not have to know about Shift-JIS for this to work; it doesn't need to know what encoding it is, at all. (Shift-JIS is not a suitable encoding for C codes, although a C program can still work with external Shift-JIS data. EUC-JP is much better for C programming, and EUC-JP will work in SQL, too.)

In both of the cases that you complain about, the correct fix lies in your code and not in SQLite precisely because they're very difficult to solve entirely generically, but much easier to solve within the context of your code.

Yes, but the SQLite will need to provide the functions for the application code to be able to solve it properly. (I have explained why the existing capabilities are insufficient.)

You added math functions, JSON functions, and window functions, so why not byte-string-dealing functions? (If necessary, a compile flag can be used, and these functions can be disabled for the WASM-based implementation. They can also be disabled when the database text encoding is UTF-16, and it is acceptable for the JSON functions to return errors when trying to deal with non-Unicode text (or blobs, which is also not possible in JSON).)

And you really don't want to do BLOB concatenation in SQLite at the moment; yes, it works, but it's dreadfully inefficient.

That is too bad. (It is often useful though, whether or not you are using Unicode; you might have blobs which are not text at all.)

(7) By Donal Fellows (dkfellows) on 2022-06-28 08:51:48 in reply to 5 [link] [source]

Right now, the proposed "fix" is that you ensure that the data in the DB is effectively timezone-free and encoded in one of the supported Unicode encodings (most likely UTF-8). Like that, you'll be working with the SQLite setup that is very well supported and where a lot of people have used it successfully. If a built-in function does not do The Right Thing™ given that baseline, it makes for a reasonable thing to have a discussion about what sort of bug might be present.

But the full complexity of timezones and encodings is best kept out of the database, and belongs in your application code. You might grumble that this adds bloat to your code, but putting that support in the database engine is not getting rid of the bloat; it'd just be moving it around. If your application only actually deals with a small number of contemporaneous timezones (e.g., just the current one where the application code is running) and a small number of encodings, it can safely omit a lot of the complexity, but knowing that it is safe to do that is a very application-specific thing. Without that, you need the timezone and encoding tables and a lot of trickiness; many many programming languages have libraries available for that. The operating system might supply some of the information you'd need (or might be Windows and not hand that out; thanks, Microsoft!).

If you wish to continue the discussion, establish why the database ought to solve your application problems for you.

(9) By anonymous on 2022-06-28 20:29:36 in reply to 7 [link] [source]

ensure that the data in the DB is effectively timezone-free

This is (usually) possible, and I do this. (And some databases do not deal with date/time at all.)

encoded in one of the supported Unicode encodings

There are sometimes good reasons to not do this. What I am saying is that the SQLite core shouldn't care about the encoding; merely that it is a null-terminated sequence of bytes. It is possible to do this, but involves doing one of:

Delete many lines (and make a few minor changes to some other lines) from the SQLite amalgamation and recompile. This would mean that it must be repeated when a new version of SQLite is installed, and must be compiled separately for programs that need backward compatibility with the existing implementation (fortunately, I do not have any such programs, but this will not be true for everyone).
Copy most of func.c and delete all of the stuff dealing with Unicode, and all of the stuff that is only for internal use. This results in code duplication and prevents some internal optimizations from working. Alternatively, rewrite these functions; but that is just extra work.
Use casts everywhere in the SQL code. This will make it messy.
Store all texts as blobs instead. This also requires the use of casts everywhere and also makes it messy.
If your program has no need to manipulate the text within SQL codes, then you do not need to do anything about it, and it will work OK (as long as you do not try to open a database with UTF-16 text, but this can be avoided easily enough).

Currently, I just usually ignore it (item 5 in the list above), but in the cases where it is necessary to do something, will be something more like item 2, but this causes some optimizations to not work.

If a built-in function does not do The Right Thing™ given that baseline...

What is "The Right Thing" can depend on the program. The simplest thing is to count the number of bytes. Anyways in Unicode, "code points" and "characters" is not same thing (and the use of private characters makes this more complicated), and measurement of bytes may be useful anyways in many cases (currently a cast is required), such as to limit the amount of disk space or memory being used. If it is not Unicode, measurement in bytes is also more commonly useful, even more so than it is with Unicode.

But the full complexity of timezones and encodings is best kept out of the database, and belongs in your application code.

Agreed, but then why do the built-in date/time functions deal with time zones (in a way which can be ignored, but cannot be overridden)?

You might grumble that this adds bloat to your code, but putting that support in the database engine is not getting rid of the bloat; it'd just be moving it around.

I am not suggesting putting that support in the database engine; I am suggesting removing it from the database engine, and putting it in the VFS instead. (Unfortunately, the existing implementation already has the time zone dealing in the database engine, which it shouldn't have.)

If you wish to continue the discussion, establish why the database ought to solve your application problems for you.

I am not trying to say that the database ought to solve the application problems. I am trying to say that it shouldn't try to solve them.

But the implementation does try to solve them, and therefore I have to write a lot of extra code (much of which is duplication of work) just to avoid these (mis)features.

There are several options to enable certain extensions when compiling the amalgamation. But, is there an option to include your own extension? (I cannot find any such thing, either in the documentation or in the amalgamation code. It would seem to me that this might be useful for many people (who might want to compile SQLite once and use it for many programs, with the extension in everything that links to it including the command shell), and could be implemented in possibly six lines of code (I have an idea of how to modify it to do this, but I have not actually tried, yet), and a small amount of added documentation.)

(11) By Keith Medcalf (kmedcalf) on 2022-06-28 22:49:32 in reply to 9 [link] [source]

There are several options to enable certain extensions when compiling the amalgamation. But, is there an option to include your own extension? (I cannot find any such thing, either in the documentation or in the amalgamation code. It would seem to me that this might be useful for many people (who might want to compile SQLite once and use it for many programs, with the extension in everything that links to it including the command shell), and could be implemented in possibly six lines of code (I have an idea of how to modify it to do this, but I have not actually tried, yet), and a small amount of added documentation.)

Adding an extension so that it is "built in" (as in included in the database engine code) and then gets "activated" (auto_extension) for each connection is rather simple.

You will note in main.c that there are two optional defines which are called as part of the initialize and shutdown procedures (one define for each) so that you can specify a function to be called to "initialize" your stuff AFTER the internal initialization is complete, and a function to clean your stuff BEFORE the internal shutdown.

SQLITE_EXTRA_INIT set to function name. Function takes one integer parameter
SQLITE_EXTRA_SHUTDOWN set to function name. Function takes no parameter

I have several hundred extensions (including lots of builtin function overrides) that register thousands of UDFs, Aggregates, and Window Functions.

See http://www.dessus.com/files/sqlite3extensions.zip for examples.

(12) By anonymous on 2022-06-29 03:35:42 in reply to 11 [link] [source]

This is good to know. SQLITE_EXTRA_INIT does not seem to be documented, as far as I can tell. However, I can find where it is used in the source code. SQLITE_EXTRA_INIT takes one argument of type const char* and it is always 0. Its return value is of type int and is returned from sqlite3_initialize (and probably should be zero). This should probably be documented (possibly with warnings if necessary).

Also I had found that there is a way to override the computation of local time, although it uses the unstable sqlite3_test_control interface. It sets the private sqlite3GlobalConfig.xAltLocaltime variable, and does not work if SQLITE_UNTESTABLE is defined (although that variable still exists in such a case but will be ignored).