SQLite Forum

Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings)
Login

Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings)

(1) By Simon Willison (simonw) on 2020-09-26 17:43:07 [link] [source]

I often import data from CSV files (usually from government open data portal websites).

This results in a table where every column is of type TEXT - but I'd like to convert columns that appear to exclusively contain integers or floating models to INTEGER or REAL column.

So I'm writing code to detect the suggested type for a column, and I want to do it as efficiently as possible/

I've worked out the following recipe for detecting a text column which is full of integers:

select
  'contains_non_integer' as result
from
  mytable
where
  cast(cast(mycolumn AS INTEGER) AS TEXT) != mycolumn
limit
  1

This query will scan through the table in (I think) primary key order and will terminate and return a single row with a single result in it the moment it runs across the first row that contains a text value that doesn't represent an integer.

If the column does contain exclusively integers, it will scan the entire column and eventually return zero rows at the end.

So two questions:

  1. Is this the most efficient way to do this? Is there another query I can use that will be more effective?
  2. What's an equivalent recipe for floating point numbers? I've not yet figured one out.

(2) By Keith Medcalf (kmedcalf) on 2020-09-26 19:54:13 in reply to 1 [link] [source]

SQLite3 does not have "column types" -- it has column affinities. Each "value" has a storage type and that storage type "prefers" to be the type of that columns "affinity", so a column affinity of "text" will always result in everything being converted to "text" storage for that column (on storage/insert) unless it is a binding that cannot be converted to text (such as a blob). The csv importer (either the one in the CLI or the CSV extension) does not perform any conversions and the result is always returned as "text" even if it is not.

https://sqlite.org/datatype3.html

The "correct" way to force the conversion is to create another table with the same number of columns where the "affinity" of the columns that you think should be numeric or real or integer are declared that way, and then execute an insert into that table (INSERT INTO t2 SELECT * FROM t1). SQLite3 will "convert" the data into the specified column affinity if it can losslessly do so.

The other alternative (assuming you are using the CLI to perform the import) is to create the table you are importing into BEFORE importing the data with the correct column affinity definitions when you create the table. This will result in the data being converted from text to the declared affinity on insert if this can be done losslessly.

>type t.csv
a,b,c,d
1,"", ," "
2,3 or 4,4.5,"5"
3,"testµtest",5,
4,,6.1,7
5
6,5503599627370495,5503599627370495.0,
7,1e3,1.e3,.1e4
8,a,b,c,excess column
9
SQLite version 3.34.0 2020-09-22 02:06:09
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table x(a numeric, b numeric, c numeric, d numeric);
sqlite> .mode csv
sqlite> .headers on
sqlite> .import t.csv x
t.csv:6: expected 4 columns but found 1 - filling the rest with NULL
t.csv:9: expected 4 columns but found 5 - extras ignored
t.csv:10: expected 4 columns but found 1 - filling the rest with NULL
sqlite> select typeof(a),a,typeof(b),b,typeof(c),c,typeof(d),d from x;
typeof(a)|a|typeof(b)|b|typeof(c)|c|typeof(d)|d
text|a|text|b|text|c|text|d
integer|1|text||text| |text|
integer|2|text|3 or 4|real|4.5|integer|5
integer|3|text|test?test|integer|5|text|
integer|4|text||real|6.1|integer|7
integer|5|null||null||null|
integer|6|integer|5503599627370495|integer|5503599627370495|text|
integer|7|integer|1000|integer|1000|integer|1000
integer|8|text|a|text|b|text|c
integer|9|null||null||null|
sqlite>

Another possible method is to use my VSV extension which is an enhanced CSV extension. It does not (anymore) set a column affinity for the data though you may do so with the schema= and columns= parameters (though this does nothing other than affect the virtual table definition if, for example, you do a "create table t2 as select * from t1" those column affinities will be carried forth into the new table). This will not have any effect on the actual returned data values, however, unless you also specify that you want "affinity conversions" applied to the source data by specifying an affinity= parameter. Specification of the full set of conversions (validatetext,affinity=numeric,nulls) will result in each "value" being returned in its bestest representation (text/blob/integer/real/null) assuming that you are using a compiler that supports LONG DOUBLE (if your compiler does not support LONG DOUBLE as an 80-bit IEEE extended double then the conversion is limited somewhat -- examples of broken compilers are all Microsoft x64 compilers).

sqlite> create virtual table x using vsv(filename=t.csv,header,nulls,validatetext,affinity=numeric);
sqlite> .schema x
CREATE VIRTUAL TABLE x using vsv(filename=t.csv,header,nulls,validatetext,affinity=numeric)
/* x(a,b,c,d) */;
sqlite> select typeof(a),a,typeof(b),b,typeof(c),c,typeof(d),d from x;
┌───────────┬───┬───────────┬──────────────────┬───────────┬──────────────────┬───────────┬──────┐
│ typeof(a) │ a │ typeof(b) │        b         │ typeof(c) │        c         │ typeof(d) │  d   │
├───────────┼───┼───────────┼──────────────────┼───────────┼──────────────────┼───────────┼──────┤
│ integer   │ 1 │ text      │                  │ text      │                  │ text      │      │
│ integer   │ 2 │ text      │ 3 or 4           │ real      │ 4.5              │ integer   │ 5    │
│ integer   │ 3 │ blob      │ test?test        │ integer   │ 5                │ null      │      │
│ integer   │ 4 │ null      │                  │ real      │ 6.1              │ integer   │ 7    │
│ integer   │ 5 │ null      │                  │ null      │                  │ null      │      │
│ integer   │ 6 │ integer   │ 5503599627370495 │ integer   │ 5503599627370495 │ null      │      │
│ integer   │ 7 │ integer   │ 1000             │ integer   │ 1000             │ integer   │ 1000 │
│ integer   │ 8 │ text      │ a                │ text      │ b                │ text      │ c    │
│ integer   │ 9 │ null      │                  │ null      │                  │ null      │      │
└───────────┴───┴───────────┴──────────────────┴───────────┴──────────────────┴───────────┴──────┘

You can find VSV.C here:
http://www.dessus.com/files/vsv.c

and the whole collection of extensions here:
http://www.dessus.com/files/sqlite3extensions.zip

(3.1) By Simon Willison (simonw) on 2020-09-27 05:02:58 edited from 3.0 in reply to 2 [link] [source]

Yup, I understand column affinities.

The problem I'm trying to solve is, given a CSV file produced by someone else entirely, how can I efficiently detect "every single value in the category column consists of a string of integer characters" - so I can automatically create a new table with the correct affinities and then execute that conversion via in insert (using the transform tool I wrote about here: https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/ )

So for example say I have a table called people that looks like this, where every column is of TEXT affinity:

name category score
Joan 1 3.5
Rahul 2 4.5
Amanda 1 1.1

I plan to run SQL queries that detect that the category column contains only strings that are valid integers, and the score column contains only strings that are floating point numbers - so I can then automatically create a new table with the correct affinities and copy the data across.

This query is the best I've come up with for efficiently answering the question "does the category column contain any strings that do NOT work as integers" - but I'm not sure if it's the best approach, and I'm also interested in figuring out an equivalent mechanism that deals with floating point numbers.

select
  'contains_non_integer' as result
from
  people
where
  cast(cast(category AS INTEGER) AS TEXT) != category
limit
  1

(4) By Keith Medcalf (kmedcalf) on 2020-09-27 06:22:54 in reply to 3.1 [link] [source]

You have to retrieve all the data and do the conversion/identification in your application and from that determine the applicable affinity to set in the new table, then after you create the table use INSERT INTO ... SELECT * FROM ... to actually convert the data.

(5) By anonymous on 2020-09-27 07:52:32 in reply to 1 [link] [source]

I use the ABS() function. It will return zero on anything non-numeric. The only caveat is zero.

Check like:

iif(abs(field) > 0 or field = 0,...,...)

(6) By anonymous on 2020-09-27 08:00:38 in reply to 1 [link] [source]

CORRECTION:

I use the ABS() function. It will return zero on anything non-numeric. The only caveat is zero.

Check like:

iif(abs(field) > 0 or field*1 = 0,...,...)

(7.1) By Simon Willison (simonw) on 2020-09-27 20:20:01 edited from 7.0 in reply to 4 [link] [source]

I don't want to retrieve all of the data into my programs memory if I can at all avoid it, because it could be millions of rows. That's why I'm looking for a solution that executes in SQLite as much as possible.

My integer detecting query seems to work fine in that respect, but I'm checking to see if there are even more efficient queries I can use.

(8) By Simon Willison (simonw) on 2020-09-27 20:22:44 in reply to 5 [link] [source]

Unfortunately it looks like abs() gets confused by strings that start with an integer but then contain non-integer text:

select abs('101'), abs('3.2'), abs('dog'), abs('1 dog');

Returns:

abs('101') abs('3.2') abs('dog') abs('1 dog')
101.0 3.2 0.0 1.0

https://latest.datasette.io/fixtures?sql=select+abs%28%27101%27%29%2C+abs%28%273.2%27%29%2C+abs%28%27dog%27%29%2C+abs%28%271+dog%27%29%3B

(9) By Simon Willison (simonw) on 2020-09-27 20:29:41 in reply to 1 [link] [source]

I think I've figured out the floating point equivalent:

select
  cast(cast(:value AS REAL) AS TEXT) in (:value, :value || '.0')

I tried this first:

select
  cast(cast(:value AS REAL) AS TEXT) = :value

But that failed on values like 1 because they cast turned them into 1.0.

Adding that comparison against :value || '.0' seems to catch that case nicely. Demo:

select
  value,
  cast(cast(value AS REAL) AS TEXT) in (value, value || '.0') as is_valid_float
from
  (
    select
      '1' as value
    union
    select
      '1.1' as value
    union
    select
      'dog' as value
    union
    select
      null as value
  )

https://latest.datasette.io/fixtures?sql=select%0D%0A++value%2C%0D%0A++cast%28cast%28value+AS+REAL%29+AS+TEXT%29+in+%28value%2C+value+%7C%7C+%27.0%27%29+as+is_valid_float%0D%0Afrom%0D%0A++%28%0D%0A++++select%0D%0A++++++%271%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++%271.1%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++%27dog%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++null+as+value%0D%0A++%29

Results:

value	is_valid_float
  	 
1 	1
1.1 	1
dog 	0

(10) By Simon Willison (simonw) on 2020-09-27 20:43:05 in reply to 9 [link] [source]

(11) By Richard Damon (RichardDamon) on 2020-09-27 20:44:37 in reply to 7.1 [source]

Don't forget that SQLite isn't some Client-Server database, but runs in your userspace. Anything that SQLite reads, HAS come into your application, it just stayed inside the subroutine call. The cost to move the results from inside SQLite to 'your app' is small, and if SQLite doesn't have something built in to do the operation, it is likely faster to read the data and decide for yourself.

(12) By Keith Medcalf (kmedcalf) on 2020-09-27 21:23:26 in reply to 3.1 [link] [source]

I basically use the equivalent of a regex to find out if a whup of text is an integer or real.

**  integer             if the field data looks like an integer,
**                      (regex "^ *(\+|-)?\d+ *$"),
**                      then an integer will be returned as
**                      provided by the compiler and platform
**                      runtime strtoll function
**                      otherwise the field will be processed as
**                      text as defined above
**
**  real                if the field data looks like a number,
**                      (regex "^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$")
**                      then a double will be returned as
**                      provided by the compiler and platform
**                      runtime strtold function otherwise the
**                      field will be processed as text as
**                      defined above
**

There is a regex extension in the standard extensions place in the SQLite3 repository that can do this, but I did it by intelli-force in the following C code:

/*
**
** Determine affinity of field
**
** ignore leading space
** then may have + or -
** then may have digits or . (if . found then type=real)
** then may have digits (if another . then not number)
** then may have e (if found then type=real)
** then may have + or -
** then may have digits
** then may have trailing space
*/
static int vsv_isValidNumber(char *arg)
{
    char *start;
    char *stop;
    int isValid = 0;
    int hasDigit = 0;

    start = arg;
    stop = arg + strlen(arg) - 1;
    while (start <= stop && *start==' ')                // strip spaces from begining
    {
        start++;
    }
    while (start <= stop && *stop==' ')                 // strip spaces from end
    {
        stop--;
    }
    if (start > stop)
    {
        goto vsv_end_isValidNumber;
    }
    if (start <= stop && (*start=='+' || *start=='-'))  // may have + or -
    {
        start++;
    }
    if (start <= stop && isdigit(*start))               // must have a digit to be valid
    {
        hasDigit = 1;
        isValid = 1;
    }
    while (start <= stop && isdigit(*start))            // bunch of digits
    {
        start++;
    }
    if (start <= stop && *start=='.')                   // may have .
    {
        isValid = 2;
        start++;
    }
    if (start <= stop && isdigit(*start))
    {
        hasDigit = 1;
    }
    while (start <= stop && isdigit(*start))            // bunch of digits
    {
        start++;
    }
    if (!hasDigit)                                      // no digits then invalid
    {
        isValid = 0;
        goto vsv_end_isValidNumber;
    }
    if (start <= stop && (*start=='e' || *start=='E'))  // may have 'e' or 'E'
    {
        isValid = 3;
        start++;
    }
    if (start <= stop && isValid == 3 && (*start == '+' || *start == '-'))
    {
        start++;
    }
    if (start <= stop && isValid == 3 && isdigit(*start))
    {
        isValid = 2;
    }
    while (start <= stop && isdigit(*start))            // bunch of digits
    {
        start++;
    }
    if (isValid == 3)
    {
        isValid = 0;
    }
vsv_end_isValidNumber:
    if (start <= stop)
    {
        isValid = 0;
    }
    return isValid;
}

This will return 0, 1, or 2. 0 means not NUMERIC, 1 means matches the integer regex above, and 2 means matches the float regex above. Whether the number is actually an integer or a real must be determined "by other means" because the number 1.1e2 "looks like" a float but is really just the integer 110. (So really, if the result is NOT 0 then the affinity is NUMERIC).

(13) By Simon Willison (simonw) on 2020-09-28 01:45:01 in reply to 12 [link] [source]

That's really useful, thank you for the example code!

(14.1) By Simon Slavin (slavin) on 2020-09-28 13:55:52 edited from 14.0 in reply to 9 [link] [source]

Add to your test suite

0 -0.01 -1 -1.01 1e12 -1e12 1e-12 -1e-12 1.1e-12 -1.1e-12 1brick -1brick

(15) By anonymous on 2020-09-28 15:39:36 in reply to 10 [link] [source]

Tried the link, changed one of the values to 1E1, but it wasn't recognised as a number.

(16) By Simon Willison (simonw) on 2020-09-28 16:33:25 in reply to 15 [link] [source]

Yes, it has MANY limitations with things like that syntax, NaN etc.

(17) By Keith Medcalf (kmedcalf) on 2020-09-28 16:56:26 in reply to 16 [link] [source]

See http://www.dessus.com/files/sqlnumeric.c

Implements a function ToBestType(x) which converts the value x passed as a parameter to its Best Type.

  • TEXT or BLOB parameter that "looks like a number" returns the number as Integer or Real as appropriate.
  • TEXT or BLOB parameter that is zero length returns NULL.
  • INTEGER or REAL parameter is converted to text and returns the same number in its most appropriate format (Integer or Real).
  • Everything else returns the parameter given.

(18) By Keith Medcalf (kmedcalf) on 2020-09-28 17:06:33 in reply to 14.1 [link] [source]

>gcc -s -mdll -m64 -I/source/bld sqlnumeric.c -o sqlnumeric.dll

>sqlite
SQLite version 3.34.0 2020-09-26 18:58:45
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load sqlnumeric
sqlite> create table x(x);
sqlite> insert into x values ('0');
sqlite> insert into x values ('-0.01');
sqlite> insert into x values ('-1');
sqlite> insert into x values ('-1.01');
sqlite> insert into x values ('1e12');
sqlite> insert into x values ('-1e12');
sqlite> insert into x values ('1e-12');
sqlite> insert into x values ('-1e-12');
sqlite> insert into x values ('1.1e-12');
sqlite> insert into x values ('-1.1e-12');
sqlite> insert into x values ('1brick');
sqlite> insert into x values ('-1brick');
sqlite> select x, typeof(y), y
   ...>   from (
   ...>         select x,
   ...>                ToBestType(x) as y
   ...>           from x
   ...>        )
   ...> ;
┌──────────┬───────────┬────────────────┐
│    x     │ typeof(y) │       y        │
├──────────┼───────────┼────────────────┤
│ 0        │ integer   │ 0              │
│ -0.01    │ real      │ -0.01          │
│ -1       │ integer   │ -1             │
│ -1.01    │ real      │ -1.01          │
│ 1e12     │ integer   │ 1000000000000  │
│ -1e12    │ integer   │ -1000000000000 │
│ 1e-12    │ real      │ 1.0e-12        │
│ -1e-12   │ real      │ -1.0e-12       │
│ 1.1e-12  │ real      │ 1.1e-12        │
│ -1.1e-12 │ real      │ -1.1e-12       │
│ 1brick   │ text      │ 1brick         │
│ -1brick  │ text      │ -1brick        │
└──────────┴───────────┴────────────────┘
sqlite> create table y(x, y numeric);
sqlite> insert into y select x,x from x;
sqlite> select x, typeof(y), y from y;
┌──────────┬───────────┬────────────────┐
│    x     │ typeof(y) │       y        │
├──────────┼───────────┼────────────────┤
│ 0        │ integer   │ 0              │
│ -0.01    │ real      │ -0.01          │
│ -1       │ integer   │ -1             │
│ -1.01    │ real      │ -1.01          │
│ 1e12     │ integer   │ 1000000000000  │
│ -1e12    │ integer   │ -1000000000000 │
│ 1e-12    │ real      │ 1.0e-12        │
│ -1e-12   │ real      │ -1.0e-12       │
│ 1.1e-12  │ real      │ 1.1e-12        │
│ -1.1e-12 │ real      │ -1.1e-12       │
│ 1brick   │ text      │ 1brick         │
│ -1brick  │ text      │ -1brick        │
└──────────┴───────────┴────────────────┘
sqlite>

(19) By Simon Willison (simonw) on 2020-10-08 03:06:08 in reply to 17 [link] [source]

That's brilliant!