Database file name encoding -- is it really UTF-8?

(1) By anonymous on 2021-12-12 10:05:06 [link] [source]

I have a question about encoding file names of the database that are passed to sqlite3_open_v2.

The documentation says it should be UTF-8. However, looking inside the code, I seem to see that on Unix the file name passed to the function is used verbatim in an OS open-file API. (On MS-Windows, I do see the file name converted to wchar_t, i.e. to UTF-16 encoding used by NTFS.)

So what should a calling program do on Unix systems in locales that use non-UTF-8 encoding for file names? Should it re-encode in UTF-8, or should it pass the file names in their locale's codeset encoding?

Thanks.

(2) By Warren Young (wyoung) on 2021-12-12 20:46:45 in reply to 1 [link] [source]

What proof do you have that the locale affects the file names stored on disk?

Try this for me:

$ touch hello
$ ls hello | od -c

Then do the same in your native language. (e.g. "hola" for Spanish.)

Please also post the output of the locale command on your system.

(8) By anonymous on 2021-12-13 12:49:59 in reply to 2 [link] [source]

What proof do you have that the locale affects the file names stored on disk?

It doesn't affect the stored bytes, it affects their interpretation when displayed to users.

(3) By anonymous on 2021-12-12 22:40:40 in reply to 1 [link] [source]

I think passing file names verbatim is more useful than trying to convert to/from UTF-8. (In the case of a file system with UTF-16 file names, converting UTF-8 to UTF-16 is useful, but otherwise, it is better to just pass them without trying to convert them.)

(4) By Richard Damon (RichardDamon) on 2021-12-13 01:29:35 in reply to 1 [link] [source]

I think the key thing is that Unix treats a 'file name' as just an array of bytes with minimal interpretation by the kernel. I believe the only bytes that have meaning are the "NUL' byte (value 0) which terminates the name, and the value for ASCII '/' which is a path separator.

SHELLS (and other programs) in displaying file names may provide an interpretation as in encoding of character sets, but that is NOT important to the kernel.

Most modern *nix systems default to using UTF-8, so that is probably why it is mentioned, but if the system tends to use some other locale, then just provide the name in that locale, and it gets passed unchanged.

Microsoft's file system generally wants file names specified as UTF-16, which is normally uncommon in platform-independent programming, so SQLite expects UTF-8 and converts to UTF-16 for you.

(5) By Harald Hanche-Olsen (hanche) on 2021-12-13 06:28:39 in reply to 4 [link] [source]

For completeness' sake, on macOS the file system enforces UTF-8 file names. On older filesystems (HFS), it uses a decomposed normal form (so, e.g., ‘å’ becomes ‘a’ followed by a ring accent), albeit a somewhat macOS-specific version. (I am rather fuzzy on the details. The macOS native iconv program knows this encoding as UTF-8-MAC.)

On newer filesystems (APFS), UTF-8 is still enforced, but decomposed normal form is not. If you create a file using a GUI application using standard macOS libraries, the filename is created in decomposed form. But from the command line you can easily create files with name in composed form, and it will remain so on the filesystem. Further, GUI applications can open these too. The filsystem will enforce uniqueness of filenames modulo normal form, though. Say I have two files named blå.txt, one with the name in composed normal form and one in the decomposed form: They cannot coexist in the same directory. If I move one file into a directory containing the other, the other file will be replaced, even though a bytewise comparison of filenames indicate they are different.

So here be dragons …

(6) By anonymous on 2021-12-13 07:33:31 in reply to 4 [link] [source]

The core Windows components do not validate UTF-16 strings, but treat them as streams of bytes.

For example:

  U+1F4BE FLOPPY DISK 💾 = D83D + DCBE

It's possible to create a file like this:

  WCHAR *FileName = L"\U0001F4BE";
  FileName[1] = L'\0';
  CreateFileW(FileName, ...);

This code clips FileName after the first code unit, and passes the resulting illegal string, now consisting of a lone leading (high) surrogate that is no longer followed by a trailing (low) surrogate, to the CreateFileW() API function. This creates a file with a garbage name, and converting the lone leading surrogate to/from UTF-8 yields U+FFFD REPLACEMENT CHARACTER, so it's not possible to open this a file via an UTF-8 file name (but via WTF-8, for example).

Unix/Linux (and other UTF-8-oriented systems) probably have to defend themselves against the security concerns raised by overlong UTF-8 sequences, as for example overlong forms of the path separator / could escape naive path name sanity checks by applications, and forge access to restricted directories, so they probably need to treat UTF-8 strings as sequences of code points, and not just bytes?

(7) By Rowan Worth (sqweek) on 2021-12-13 09:20:00 in reply to 6 [source]

Unix/Linux (and other UTF-8-oriented systems) probably have to defend themselves against the security concerns raised by overlong UTF-8 sequences, as for example overlong forms of the path separator /

UTF-8 mandates the shortest-encoding, ie. overlong forms are erroneous (and you will no doubt end up with one or more U+FFFD code points trying to decode them). It is also quite particular about the valid bit-patterns at each byte position -- based on a lone byte anywhere in a UTF-8 stream you can tell whether you are (a) looking at a codepoint in the 7-bit ASCII range (b) at the start of a new codepoint (and also how many bytes long the codepoint is) or (c) in the middle of a code-point.

Most UTF-8 APIs in the linux world pass strings around as byte streams, with specific functions or iterators to deal with code points only when necessary.

(10) By anonymous on 2021-12-13 13:11:58 in reply to 7 [link] [source]

Glyphs in font directories are indexed by UTF-32 -- so the decoding happens all the time, even for a simple printf() statement to the terminal! Even if the kernel isn't tricked by overlong directory slashes, accepting them (as part of a whacky path component name) can be risky, for a broken glyph lookup function (with too relaxed decoding) may cause overlong slashes to look like normal ones, misleading users.

That's what I meant by "defend themselves" with more work, and not just blindly deal with bytes. Windows can do this because UTF-16 has no overlong forms.

(9) By anonymous on 2021-12-13 12:51:51 in reply to 4 [link] [source]

I agree that this is how the library works, and I asked the question after looking ta the code. My problem was with the documentation, which says UTF-8 regardless of the platform. Perhaps they assumed this will happen automatically on Posix systems 9which is mostly true, but doesn't have to be so).

(11) By Simon Slavin (slavin) on 2021-12-13 13:52:44 in reply to 1 [link] [source]

The open-file API can do its own re-encoding. You may pass it a parameter in UTF-16, but that doesn't mean it will use that form of the string when it calls the file system.

So yes, use UTF-8 as the documentation says. If you can show use that you did this and got the wrong result, then we'll take a look at it.