Select only UTF8 / Unicode characters

(1) By Trudge on 2022-06-22 15:24:51 [link] [source]

Running macOS Monterey on an M1. I'm trying to find a way to select filenames with ONLY UTF8 or Unicode characters in them. I have several musicians with accented names and want to select just those artist names. For example, I have

Amon Düül II
Esbjörn Svensson Trio
Zoltán Lantos' Mirrorworld

and running sqlite: -- Loading resources from /Users/trudge/.sqliterc SQLite version 3.38.2 2022-03-26 13:51:10

How can I select only those (and any other) artist names that contain UTF8 or Unicode characters? I understand this may be a complex situation, but I'm sure it is a common one.

(2) By Stephan Beal (stephan) on 2022-06-22 15:35:13 in reply to 1 [link] [source]

I understand this may be a complex situation, but I'm sure it is a common one.

Though i strongly beg to differ on the second point, seeing as every non-English language uses non-English characters and almost never needs to differentiate between "English-only" and "foreign" characters, here's one solution:

sqlite> create table t(a); insert into t(a) values('aaa'),('aäa'),('bbb'),('bäb');
sqlite> select * from t where length(cast(a as text)) <> length(cast(a as blob));
aäa
bäb

Why that works is described at:

https://sqlite.org/lang_corefunc.html#length

(3) By Trudge on 2022-06-22 15:56:23 in reply to 2 [link] [source]

Wow! Thank you so much for the explanation and link. Now I can see:

Ali Farka Touré                            
Ali Farka Touré & Toumani Diabaté           
Amon Düül II                              
Esbjörn Svensson Trio                      
Leszek Możdżer                            
Leszek Możdżer & Friends                  
Lucky Wüthrich                             
Mikael Ögren & Johan Agebjörn             
Nils Wülker                                
Saint-Saëns                                 
Stanley Clarke,Biréli Lagrène,Jean-Luc Ponty
Stéphane Grappelli                          
Stéphane Grappelli & Michel Petrucciani     
Stéphane Grappelli, Joe Pass, Niels Pedersen
Stéphane Kerecki Quartet                    
Till Brönner & Bob James                    
Zoltán Lantos' Mirrorworld

(5) By Harald Hanche-Olsen (hanche) on 2022-06-22 16:07:15 in reply to 3 [link] [source]

It seems that we share a taste in music.

But I can't quite resist the temptation to “be that guy” and point out that ASCII is a subset of UTF8 and Unicode, so your request as stated should select every text. It's a good thing that you clarified your request with an example.

(6) By Larry Brasfield (larrybr) on 2022-06-22 16:20:08 in reply to 5 [source]

Now that temptation prevails, and overlooking that (quite valid) subset point, I would also note that
"filenames with ONLY UTF8 or Unicode characters in them"
is not the same as
"ONLY filenames with UTF8 or Unicode characters in them".

Stephan's solution is good for the latter criterion, but not the former.

(7) By Stephan Beal (stephan) on 2022-06-22 16:46:09 in reply to 6 [link] [source]

Stephan's solution is good for the latter criterion, but not the former.

Challenge accepted ;).

sqlite> create table t(a); insert into t(a) values('aaa'),('aäa'),('bbb'),('bäb'),('äää'),('äöü');
sqlite> select * from t where length(cast(a as blob)) >= 2 * length(cast(a as text));
äää
äöü

With the caveats that (A) UTF-8 only and (B) Long Live UTF-8.

Obviously, artist's names are likely to contain ASCII spaces in them, but handling those is left as an Exercise for Someone Who Cares ;).

(8) By Larry Brasfield (larrybr) on 2022-06-22 16:56:04 in reply to 7 [link] [source]

I (almost) hate to break this to you, but ...

That 2*length() trick only works if UTF-8 encodings consisting of more than 2 octets are excluded from the input. Otherwise, a 1-octet and 3-octet character pair would masquerade as a pair of 2-octet characters.

handling those is left as an Exercise for Someone Who Cares

An excellent solution!

(9) By Stephan Beal (stephan) on 2022-06-22 17:12:03 in reply to 8 [link] [source]

Otherwise, a 1-octet and 3-octet character pair would masquerade as a pair of 2-octet characters.

Doh, indeed! Okay, that's the limit of my SQL-fu aspirations for today, but if someone is interested pursuing, it could potentially be solved by using a WITH which substr()'s the input into individual characters and evals to true if all of the characters have a byte length() of 2+.

(10) By jchd (jchd18) on 2022-06-22 19:37:28 in reply to 9 [link] [source]

Another more efficient way would be to use the regex() extension (mine uses an implementation of PCRE).

As an example, create two DBs, one using UTF8 and the other UTF16.

CREATE TABLE t (s CHAR);
INSERT INTO t VALUES('abc');
INSERT INTO t VALUES('äbc');
INSERT INTO t VALUES('1€');
INSERT INTO t VALUES('👨🏻‍👩🏿‍👦🏽');

The last row contains a Unicode string of 3 people having different Fitzpatrick modifiers linked by ZERO WIDTH JOINERs, making the string a familly displayed as a single glyph.

Consider using this query:

select case length(cast('a' as blob)) when 1 then 'UTF8' else 'UTF16' end UTF, s String, length(cast(s as char)) "#Chars", length(cast(s as blob)) "#Bytes", regexp('[^\x00-\x7f]', s) nonASCII, hex(cast(s as blob)) Hex from t;

The result are:

UTF	String	#Chars	#Bytes	nonASCII	Hex
UTF8	abc	3	3	0	616263
UTF8	äbc	3	4	1	C3A46263
UTF8	1€	2	4	1	31E282AC
UTF8	👨🏻‍👩🏿‍👦🏽	8	30	1	F09F91A8F09F8FBBE2808DF09F91A9F09F8FBFE2808DF09F91A6F09F8FBD

UTF	String	#Chars	#Bytes	nonASCII	Hex
UTF16	abc	3	6	0	610062006300
UTF16	äbc	3	6	1	E40062006300
UTF16	1€	2	4	1	3100AC20
UTF16	👨🏻‍👩🏿‍👦🏽	8	28	1	3DD868DC3CD8FBDF0D203DD869DC3CD8FFDF0D203DD866DC3CD8FDDF

(13) By Trudge on 2022-06-22 21:05:49 in reply to 6 [link] [source]

Yes, you are correct. And that is actually what I should have said. Either way, Stephan's solution worked for me. What a great resource this forum is. It seems I may have stirred up the pot a bit.

(4.1) By jchd (jchd18) on 2022-06-22 16:06:06 edited from 4.0 in reply to 2 [link] [source]

This only works for UTF8-encoded databases.

With UTF16 byte length is always at least twice as much as char length. For these, only way I see is an extension function. For instance I have one with PCRE regexes, making it easy (but slow) to check.

Adding an extra column to store a Non_ASCII (> 0x7F) flag at inset/update time with app code would be a good solution too.

(11) By Keith Medcalf (kmedcalf) on 2022-06-22 20:15:36 in reply to 1 [link] [source]

I would use an extension for handling unicode to do this. Example, unifuzz, has a UDF called UNACCENT, which removes accents (that is, folds to ascii):

create table x(x text);
insert into x values
('Ali Farka Touré'),
('Ali Farka Touré & Toumani Diabaté'),
('Amon Düül II'),
('Esbjörn Svensson Trio'),
('Leszek Możdżer'),
('Leszek Możdżer & Friends'),
('Lucky Wüthrich'),
('Mikael Ögren & Johan Agebjörn'),
('Nils Wülker'),
('Saint-Saëns'),
('Stanley Clarke,Biréli Lagrène,Jean-Luc Ponty'),
('Stéphane Grappelli'),
('Stéphane Grappelli & Michel Petrucciani'),
('Stéphane Grappelli, Joe Pass, Niels Pedersen'),
('Stéphane Kerecki Quartet'),
('Till Brönner & Bob James'),
('Zoltán Lantos'' Mirrorworld'),
('Unaccented Example');
select x, x == unaccent(x) as NoAccent from x;
┌────────────────────────────────────────────────┬──────────┐
│                       x                        │ NoAccent │
├────────────────────────────────────────────────┼──────────┤
│ 'Ali Farka Touré'                              │ 0        │
│ 'Ali Farka Touré & Toumani Diabaté'            │ 0        │
│ 'Amon Düül II'                                 │ 0        │
│ 'Esbjörn Svensson Trio'                        │ 0        │
│ 'Leszek Możdżer'                               │ 0        │
│ 'Leszek Możdżer & Friends'                     │ 0        │
│ 'Lucky Wüthrich'                               │ 0        │
│ 'Mikael Ögren & Johan Agebjörn'                │ 0        │
│ 'Nils Wülker'                                  │ 0        │
│ 'Saint-Saëns'                                  │ 0        │
│ 'Stanley Clarke,Biréli Lagrène,Jean-Luc Ponty' │ 0        │
│ 'Stéphane Grappelli'                           │ 0        │
│ 'Stéphane Grappelli & Michel Petrucciani'      │ 0        │
│ 'Stéphane Grappelli, Joe Pass, Niels Pedersen' │ 0        │
│ 'Stéphane Kerecki Quartet'                     │ 0        │
│ 'Till Brönner & Bob James'                     │ 0        │
│ 'Zoltán Lantos'' Mirrorworld'                  │ 0        │
│ 'Unaccented Example'                           │ 1        │
└────────────────────────────────────────────────┴──────────┘

(12) By jchd (jchd18) on 2022-06-22 20:25:08 in reply to 11 [link] [source]

Keith,

It's fine that you appreciate and mention my Unifuzz extension, but UnAccent does only act on Unicode codepoints having the attribute "Letter". Even more: it relies on an oldish version of the Unicode standard (5.1 if memory serves).

While artist names are unlikely to contain non-letters, but for dealing with the general case something more strict can be required, like the regex() solution I mentionned above.