Non-breaking space character breaks regexp

(1) By bslsql on 2024-03-06 00:28:40 [link] [source]

An upstream application, over which I have no control, is insreting the non-breaking space character (u00A0) into data that is imported into various tables of my database. I can find no way to find these characters using the regexp operator or string comparison with the affected fields.

Previously this upstream application used the space character.

How can I find occurrences of NBSP? If I can do that then I can replace NBSP with the expected space character.

Note I am using the CLI ratehr than hackung embedded code in to some programming language.

(2) By Bo Lindbergh (_blgl_) on 2024-03-06 00:52:49 in reply to 1 [link] [source]

Both regexp and glob find non-breaking spaces just fine for me.

sqlite> create table t (id integer primary key, val text not null);
sqlite> insert into t (val) values ('good space');
sqlite> insert into t (val) values ('bad'||char(0xA0)||'space');
sqlite> select * from t where val regexp char(0xA0);
+----+-----------+
| id |    val    |
+----+-----------+
| 2  | bad space |
+----+-----------+
sqlite> select * from t where val glob '*'||char(0xA0)||'*';
+----+-----------+
| id |    val    |
+----+-----------+
| 2  | bad space |
+----+-----------+
sqlite> update t set val = replace(val, char(0xA0), ' ') where val glob '*'||char(0xA0)||'*';
sqlite> select count(*) from t where val glob '*'||char(0xA0)||'*';
+----------+
| count(*) |
+----------+
| 0        |
+----------+

(3) By bslsql on 2024-03-06 14:42:32 in reply to 2 [link] [source]

Your minimal test case works however my test case does not.

I have data of the form 'SL1 1NJ', which happens to be the UK postcode for Windsor Castle, the problem is that what appears to be a space is randomly 0x40 or 0xA0 namely either a space or an NBSP. A regexp on 'SL1 1NJ' will fail for those with the other character.

The originating organisation duplicates the postcode in two spearate fields but the use either of these space characters is not consistent in each field. (Why two fields? NO idea but their entry staff have made mistakes with similarly duplicated fields being entered incorrectly so I retain them all for data quality checks.)

A constructed regexp of 'SL1('		char(0xA0)		'	)1JN' does not retrieve anything.

By the way there are hundreds of distinct postcodes in my database and I need to be able to pull out records from multiple of them at a time essentially for delivery rounds. And they are also not of the simplistic form of '[:letter:]{2}[:digit:]{1} [:digit:][:letter:]{2}'. The first component my have single letters or multiple digits.

(4.1) By Stephan Beal (stephan) on 2024-03-06 14:57:36 edited from 4.0 in reply to 3 [link] [source]

'SL1(' || char(0xA0) || '| )1JN'

0xA0 by itself is not a valid UTF-8 character (but it is in UTF-16). The UTF-8 encoding of an nbsp is 0xC2 0xA0. How you might represent that in SQL, other than by using the literal UTF-8 character in your SQL, is unfortunately beyond me but perhaps that clue will point you in a productive direction.

Edit: OTOH, the docs for char() seem to suggest that each argument is a Unicode Code Point, and 0xA0 is the Unicode Code Point for nbsp.

(5) By Gunter Hick (gunter_hick) on 2024-03-06 15:13:49 in reply to 3 [link] [source]

You need to SHOW YOUR WORK. That means ACTUAL INPUT and ACTUAL OUTPUT. Not a description of what you think you did. For example, you are not even showing how you are doing regexp processing. Note that LIKE uses '%' and '_' wildcards, while GLOB supports unix file globbing syntax, including the [] character class operator.

The previous poster has already shown you HOW to SANITZE YOUR DATA.

(6) By Larry Brasfield (larrybr) on 2024-03-06 15:32:59 in reply to 4.1 [link] [source]

The UTF-8 encoding of an nbsp is 0xC2 0xA0. How you might represent that in SQL, other than by using the literal UTF-8 character in your SQL ...

I have often used something like this:
cast(x'C2A0' as text)
, which is clear in the source and undergoes no additional translation by the SQLite library. The blob is merely interpreted as UTF-8 text when
PRAGMA encoding = 'UTF-8';
is in effect.

(7) By Tim Streater (Clothears) on 2024-03-06 15:49:45 in reply to 5 [link] [source]

> The previous poster has already shown you HOW to SANITZE YOUR DATA.

Yes. On writing the postcodes to the database, convert any run of char(0xA0) and spaces to a single space.

(8) By jchd (jchd18) on 2024-03-06 16:04:36 in reply to 4.1 [link] [source]

Yes, select hex(char(0xA0)) returns C2A0.

'SL1(' || char(0xA0) || '| )1JN'

contains extra characters. Maybe try

'SL1' || char(0xA0) || '1JN'

(9.1) By bslsql on 2024-03-06 16:57:28 edited from 9.0 in reply to 5 [source]

Sample CSV of the data

CompanyName,CompanyOPCSRef,DistrictRef,StreetName,PropertyAddress1,PropertyAddress2,PropertyAddress3,PropertyAddress4,PropertyAddress5,PropertyPostCode,PropertyAddressPrefix,PropertyAddressNumber,ElectorNumber,ElectorForename,ElectorSurname,ElectorDOB,ElectorCreatedMonth,ElectorChangedMonth,ElectorDeletedMonth,MarkersRegisterText,RecordCount,RecordTotal,CurrentRegisterMonth Windsor and Maidenhead Borough Council,00ME,WEGP,Windsor Great Park,Frogmore Cottage,Windsor Great Park,Windsor,Berkshire,SL4 1NJ,SL4 1NJ,Frogmore Cottage,,WEGP 10000/1,Henry Charles Albert David,Windsor,,1,0,0,L,1,2,January Windsor and Maidenhead Borough Council,00ME,WEGP,Windsor Great Park,Frogmore Cottage,Windsor Great Park,Windsor,Berkshire,SL4 1NJ,SL4 1NJ,Frogmore Cottage,,WEGP 10000/2,Rachel Megan,Merkle-Windsor,,1,0,0,G,2,2,January Windsor and Maidenhead Borough Council,00ME,WEGP,,,,,,,,,,WEGP 60000,Andrew,Windsor,,1,0,0,F,1,1,January Westminster City Council,00BK,FFS,Broadcasting House,Room 101,Broadcasting House,2-22 Portland Place, London,W1A 1AD,W1A 1AA,Room 101,,FFS 1948,Eric A,Blair,,2,0,0,Z,1,1,February Westminster City Council,00BK,FFS,Broadcasting House,Room 101,Broadcasting House,2-22 Portland Place, London,W1A 1AD,W1A 1AA,Room 101,,FFS 1948,George,Orwell,,0,2,0,Z,1,1,February

sqlite> .import --csv "Royal Dummy.csv" BlanketyBlank sqlite> .schema blanketyblank

CREATE TABLE IF NOT EXISTS "BlanketyBlank"( "CompanyName" TEXT, "CompanyOPCSRef" TEXT, "DistrictRef" TEXT, "StreetName" TEXT, "PropertyAddress1" TEXT, "PropertyAddress2" TEXT, "PropertyAddress3" TEXT, "PropertyAddress4" TEXT, "PropertyAddress5" TEXT, "PropertyPostCode" TEXT, "PropertyAddressPrefix" TEXT, "PropertyAddressNumber" TEXT, "ElectorNumber" TEXT, "ElectorForename" TEXT, "ElectorSurname" TEXT, "ElectorDOB" TEXT, "ElectorCreatedMonth" TEXT, "ElectorChangedMonth" TEXT, "ElectorDeletedMonth" TEXT, "MarkersRegisterText" TEXT, "RecordCount" TEXT, "RecordTotal" TEXT, "CurrentRegisterMonth" TEXT);

Now the test

sqlite> select * from blanketyblank where propertypostcode regexp 'SL4 1NJ'; CompanyName|CompanyOPCSRef|DistrictRef|StreetName|PropertyAddress1|PropertyAddress2|PropertyAddress3|PropertyAddress4|PropertyAddress5|PropertyPostCode|PropertyAddressPrefix|PropertyAddressNumber|ElectorNumber|ElectorForename|ElectorSurname|ElectorDOB|ElectorCreatedMonth|ElectorChangedMonth|ElectorDeletedMonth|MarkersRegisterText|RecordCount|RecordTotal|CurrentRegisterMonth Windsor and Maidenhead Borough Council|00ME|WEGP|Windsor Great Park|Frogmore Cottage|Windsor Great Park|Windsor|Berkshire|SL4 1NJ|SL4 1NJ|Frogmore Cottage||WEGP 10000/1|Henry Charles Albert David|Windsor||1|0|0|L|1|2|January

Note there is only ONE record but there should be two.

Test it the other way

sqlite> select * from blanketyblank where propertyaddress5 regexp 'SL4 1NJ'; CompanyName|CompanyOPCSRef|DistrictRef|StreetName|PropertyAddress1|PropertyAddress2|PropertyAddress3|PropertyAddress4|PropertyAddress5|PropertyPostCode|PropertyAddressPrefix|PropertyAddressNumber|ElectorNumber|ElectorForename|ElectorSurname|ElectorDOB|ElectorCreatedMonth|ElectorChangedMonth|ElectorDeletedMonth|MarkersRegisterText|RecordCount|RecordTotal|CurrentRegisterMonth Windsor and Maidenhead Borough Council|00ME|WEGP|Windsor Great Park|Frogmore Cottage|Windsor Great Park|Windsor|Berkshire|SL4 1NJ|SL4 1NJ|Frogmore Cottage||WEGP 10000/2|Rachel Megan|Merkle-Windsor||1|0|0|G|2|2|January

Again only ONE record when there should be two.

These two records have one field with NBSP and the other with space.

By the way a compatriot posted the same issue to the DuckDB discord and their developers acknowledge the difference between NBSP and space as a bug.

(10) By Bo Lindbergh (_blgl_) on 2024-03-06 18:42:51 in reply to 9.1 [link] [source]

The sample text you've pasted contains no occurrences of U+000A NO-BREAK SPACE. However, there are a few occurrences of U+202F NARROW NO-BREAK SPACE.

These are different characters; searching for one will not find the other.